Usace lock queue report

12/6/2023

This contention led to us being unable to progress any new queries across our database cluster. This pool became flooded and stuck due to contention between inserting data into and reading data from the tables in the database. The incident was mitigated through throttling of job processing and adding capacity, including overprovisioning to speed processing of the backlogged jobs.Īfter the incident, we identified that the pool, found_rows_pool managed by the vtgate layer, was overwhelmed and unresponsive. Contributing factors included a new load source from a background job querying that database cluster, maxed out database transaction pools, and underprovisioning of vtgate proxy instances that are responsible for query routing, load balancing, and sharding. We determined the cause of the impact to be a degraded database cluster. GitHub Actions fully recovered the queue of backlogged workflows at 19:03 UTC. GitHub Actions started recovering between 14:57 UTC and 16:47 UTC before degrading again. Engineers initially statused GitHub Actions nine minutes later. On March 29 at 14:10 UTC, users began to see a degraded experience with GitHub Actions with their workflows not progressing. March 29 14:21 UTC (lasting 4 hour and 57 minutes) Additionally, we have enhanced our query evaluation procedures related to database lock contention, along with improved documentation and training material. We have added additional monitoring of relevant database resources so as not to reach resource exhaustion, and detect similar issues earlier in our staged rollout process. The query causing lock tension was disabled via feature flag, and then refactored. Mitigation took longer than usual because manual intervention was required to fully recover. An initial automatic failover solved this seamlessly, but the slow query continued to cause lock contention and resource exhaustion, leading to a second failover that did not complete. The change increased the chance of lock contention, leading to increased query times and eventual resource exhaustion during brief load spikes, which caused the database to crash. The query alteration was part of a larger infrastructure change that had been rolled out gradually, starting in October 2022, then more quickly beginning February 2023, completing on March 20, 2023. The cause was traced to a change in a frequently-used database query.

Full functionality was restored at 13:17 UTC. We publicly statused Git Operations 11 minutes later, initially to yellow, followed by red for other impacted services. On March 27 at 12:14 UTC, users began to see degraded experience with Git Operations, GitHub Issues, pull requests, GitHub Actions, GitHub API requests, GitHub Codespaces, and GitHub Pages. March 27 12:25 UTC (lasting 1 hour and 33 minutes) This report also sheds light into three March incidents that resulted in degraded performance across GitHub services. In April, we experienced four incidents that resulted in degraded performance across GitHub services.

0 Comments

Usace lock queue report

Leave a Reply.

Author

Archives

Categories