GitHub Availability Report: June 2022
In June, we experienced four incidents resulting in significant impact and degraded state of availability to multiple GitHub.com services. This report also sheds light into an incident that impacted multiple GitHub.com services in May.
June 1 09:40 UTC (lasting 48 minutes)
During this incident, customers experienced delays in the start up of their GitHub Actions workflows. The cause of these delays was excessive load on a proxy server that routes traffic to the database.
At 09:37 UTC, Actions service noticed a marked increase in the time it takes customer jobs to start. Our on-call engineer was paged and Actions was statused red. Once we started to investigate, we noticed that the pods running the proxy server for the database were crash-looping due to out-of-memory errors. A change was created to increase the available memory to these pods, which fully rolled out by 10:08 UTC. We started to see recovery in Actions even before 10:08 UTC, and statused to yellow at 10:17 UTC. By 10:28 UTC, we were confident that the memory increase had mitigated the issue, and statused Actions green.
Ultimately, this issue was traced back to a set of data analysis queries being pointed at an incorrect database. The large load they placed on the database caused the crash loops and the broader impact. These queries have been moved to a dedicated analytics setup that does not serve production traffic.
We are adding alerts to identify increases in load to the proxy server to catch issues like this early. We are also investigating how we can put in guardrails to ensure production database access is limited to services that own the data.
June 21 17:02 UTC (lasting 1 hour and 10 minutes)
During this incident, shortly after the GA of Copilot, users with either a Marketplace or Sponsorship plan were unable to use Copilot. Users with those subscriptions received an error from our API responsible for creating authentication tokens. This impacted a little less than 20% of our active users at the time.
At approximately 16:45 UTC, we were alerted and noticed elevated error rates in the API and began investigating causes. We were able to identify the issue and statused red. Our engineers worked quickly to roll out a fix to the API endpoint and we saw API error rates begin lowering at approximately 17:45 UTC. By 18:00 UTC, we were no longer seeing this issue but decided to wait for 10 more minutes to status back to green to ensure there were no regressions.
We have increased our testing around this particular combination of subscription types, added these scenarios to our user testing and will add additional data shape testing before future rollouts.
June 28 17:16 UTC (lasting 26 minutes)
Our alerting systems detected degraded availability for Codespaces during this time. Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update on the causes and remediations in the July Availability Report, which will be published the first Wednesday of August.
June 29 14:48 UTC (lasting 1 hour and 27 minutes)
During this incident, services including GitHub Actions, API Requests, Codespaces, Git Operations, GitHub Packages, and GitHub Pages were impacted. As we continue to investigate the contributing factors, we will provide a more detailed update in the July Availability Report. We will also share more about our efforts to minimize the impact of similar incidents in the future.
Follow up to May 27 04:26 UTC (lasting 21 minutes) and May 27 07:36 UTC (lasting 1 hour and 21 minutes)
As mentioned in the May Availability Report, we are now providing a more detailed update on this incident following further investigation.
Both instances that occurred at 04:26 and 07:36 UTC were caused by the same contributing factors. In the first instance, an individual service team noticed higher than normal load and an increase in error rate on API requests and statused red. The load was particularly high on our login endpoint. While this did elevate error rates, it was not enough to cause a widespread outage and we should have likely statused yellow in this instance.
After follow-up that indicated the load pattern had subsided, our on-call team determined it was safe to report the situation was mitigated and began to investigate further.
However, three hours later, we again experienced a degradation of service from a sustained high load in traffic. This was again concentrated on our login endpoint. We statused all services red, since we were seeing sustained error rates for a variety of clients and situations, and then updated individual service statuses based on their SLOs. Services that were affected by the load pattern statused to yellow, while services that were not impacted statused back to green.
The duration of impact to GitHub.com from the second instance of the load pattern lasted about 15 minutes. We continued to see elevated traffic during this time and waited until a network-level mitigation was rolled out before statusing all affected services back to green.
In addition to network mitigation, we were able to use the data from this incident to add additional mitigations on the application side for a sustained load of this type, as well as inform architectural changes we can make in the future to make our services more resilient.
Following this incident, we are improving our on-call procedures to ensure we always report the correct status level based on SLO review. While we always want to over-communicate issues with customers for awareness, we want to only status red when necessary.
We will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. To receive real-time updates on status changes, please follow our status page. You can also learn more about what we’re working on on the GitHub Engineering Blog.