PHPnews.io

Inside a CODE RED: Network Edition

Written by Signal vs. Noise / Original link on Sep. 4, 2020

I wanted to follow up to Jeremy’s post about our recent outages with a deeper, more personal look behind the scenes. We call our major incident response efforts “CODE REDs” to signify that it is an all-hands-on-deck event and this definitely qualified. I want to go beyond the summary and help you see how an event like this unfolds over time. This post is meant for both people who want a deeper, technical understanding of the outage, as well as some insight into the human side of incident management at Basecamp.

The Prologue

The seeds of our issues this week started a few months ago. Two unrelated events started the ball rolling. The first event was a change in our networking providers. We have redundant metro links between our primary datacenter in Ashburn, VA and our other DC in Chicago, IL. Our prior vendor had been acquired and the new owner wanted us to change our service over to their standard offering. We used this opportunity to resurvey the market and decided to make a change. We ran the new provider alongside the other for several weeks. Then, we switched over entirely in late June.

The second event occurred around this same time when a security researcher notified us of a vulnerability. We quickly found a workaround for the issue by setting rules on our load balancers. These customizations felt sub-optimal and somewhat brittle. With some further digging, we discovered a new version of load balancer firmware that had specific support for eliminating the vulnerability and we decided to do a firmware upgrade. We first upgraded our Chicago site and ran the new version for a few weeks. After seeing no issues, we updated our Ashburn site one month ago. We validated the vulnerability was fixed and things looked good.

Incident #1

Our first incident began on Friday, August 28th at 11:59AM CDT. We received a flood of alerts from from PagerDuty, Nagios and Prometheus. The Ops team quickly convened on our coordination call line. Monitoring showed we lost our newer metro link for about 20-30 seconds. Slow BC3 response times continued despite the return of the network. We then noticed chats and pings were not working at all. Chat reconnections were overloading our network and slowing all of BC3. Since the problem was clearly related to chat, we restarted the Cable service. This didn’t resolve the connection issues. We then opted to turn chat off at the load balancer layer. Our goal was to make sure the rest of BC3 stabilized. The other services did settle as hoped. We restarted Cable again with no effect. Finally, as the noise died down, we noticed a stubborn alert for a single Redis DB instance.

Initially, we overlooked this warning because the DB was not down. We probed it from the command line and it still responded. We kept looking and finally discovered replication errors on a standby server and saw the replica was stuck in a resynchronization loop. The loop kept stealing resources and slowing the primary node. Redis wasn’t down but it was so that slow that it was only responding to monitoring checks. We restarted Redis on the replica and saw immediate improvement. BC3 soon returned to normal. Our issue was not a novel Redis problem but it was new to us. You can find much more detail here.

The Postmortem

The big question lingering afterward was “how can a 30 second loss of connectivity on a single redundant networking link take down BC3?” It was clear that the replication problem caused the pain. But, it seemed out of character that dropping one of two links would trigger this kind of Redis failure. As we went through logs following the incident, we were able to see that BOTH of our metro links had dropped for short periods. We reached out to our providers in search of an explanation. Early feedback pointed to some sub-optimal BGP configuration settings. But, this didn’t fully explain the loss of both circuits. We kept digging.

This seems as good a time as any for the confessional part of the story. Public postmortems can be challenging because not all of the explanations look great for people involved. Sometimes, human error contributes to service outages. In this case, my own errors in judgement and lack of focus came into play. You may recall we tripped across a known Redis issue with documented workaround. I created a todo for us to make those configuration changes to our Redis servers. The incident happened on a Friday when all but 2 Ops team members where off for the day. Mondays are always a busy, kick-off-the-week kind of day and I was also when I started my oncall rotation. I failed to make sure that config change was clearly assigned or finished with the sense of urgency it deserved. I’ve done this for long enough to know better. But, I missed it. As an Ops lead and active member of the team, every outage hurts. But this one is on me and it hurts even more so.

Incident #2

At 9:39AM on Tuesday, 9/01, the unimaginable happened. Clearly, it isn’t unimaginable and a repeat now seems inevitable. But, this was not our mindset on Tuesday morning. Both metro links dropped for about 30 seconds and Friday began to repeat itself. We can’t know if the Redis config changes would have saved us because they had not been made (you can be sure they are done now!). We recognized the problem immediately and sprang into action. We restarted the Redis replica and the Cable service. It looked like things were returning to normal 5 minutes after the network flap. Unfortunately, our quick response during peak load on a Tuesday had unintended consequences. We saw a “thundering herd” of chat reconnects hit our Ashburn DC and the load balancers couldn’t handle the volume. Our primary load balancer locked up under the load and the secondary tried to take over. The failover didn’t register with the downstream hosts in the DC and we were down in our primary DC. This meant BC3, BC2, basecamp.com, Launchpad and supporting services were all inaccessible. We attempted to turn off network connections into Ashburn but our chat ops server was impacted and we have to manually reconfigure the routers to disable anycast. The problem of peak traffic on Tuesday is much different than managing problems on a Friday.

We begin moving all of our services to our secondary DC in Chicago. We move BC3 completely. While preparing to move BC2 and Launchpad, we apply the manual router changes and the network in Ashburn settles. We decide to stop all service movement focus on stability for the rest of the day. That night after traffic dies down, we move all of our services back to their normal operating locations.

One new piece of the puzzle drops into place. The second round of network drops allowed our providers to watch in real time as events unfolded. We learn that both of our metro links share a physical path in Pennsylvania, which was affected by a fiber cut. A single fiber cut in the middle of Pennsylvania could still hit us unexpectedly. This was a surprise to us as it was to our providers. At least we could now make concrete plans to remove this new problem from our environment.

Incident #3

We rotate oncall shifts across the Ops team. As 2020 would have it, this was my week. After a late night of maintenances, I hoped for a slow Wednesday morning. At 6:55AM CDT on 9/2, PagerDuty informed me of a different plan. Things were returning to normal by the time I got setup. We could see our primary load balancer had crashed and failed over to the secondary unit. This caused about 2 minutes of downtime across most of our Basecamp services. Thankfully, the failover went smoothly. We immediately ship the core dump file to our load balancer vendor and start combing logs for signs of unusual traffic. This felt the same as Incident #2 but the metrics were all different. While there had been a rise in CPU on the load balancers, it was no where near the 100% utilization of the day before. We wondered about Cable traffic – mostly because of the recent issues. There was no sign of a network flap. We looked for evidence of a bad load balancer device or other network problem. Nothing stood out.

At 10:49AM, PagerDuty reared again. We suffered a second load balancer failover. Now we are back at peak traffic and the ARP synchronization on downstream devices fails. We are hard down for all of our Ashburn-based services. We decide to disable anycast for BC3 in Ashburn and run only from Chicago. This is again a manual change that is hampered by high load but it does stabilize the our services. We send the new core file off to our vendor and start parallel work streams to get us to some place of comfort.

These separate threads spawn immediately. I stay in the middle of coordinating between them while updating the rest of the company on status. Ideas come from all directions and we quickly prioritize efforts across the Ops team. We escalate crash analysis with our load balancer vendor. We consider moving everything to out of Ashburn. We expedite orders for upgraded load balancers. We prep our onsite remote hands team for action. We start spinning up virtual load balancers in AWS. We dig through logs and problem reports looking for any sign of a smoking gun. Nothing emerges … for hours.

Getting through the “waiting place” is hard. On the one hand, systems were pretty stable. On the other hand, we had been hit hard with outages for multiple days and our confidence was wrecked. There is a huge bias to want to “do something” in these moments. There was a strong pull to move out of Ashburn to Chicago. Yet, we have the same load balancers with the same firmware in Chicago. While Chicago has been stable, what if it is only because it hasn’t seen the same load? We could put new load balancers in the cloud! We’ve never done that before and while we know what problem that might fix – what other problems might it create? We wanted to move the BC3 backend to Chicago – but this process guaranteed a few of minutes of customer disruption when everyone was on shaky ground. We call our load balancer vendor every hour asking for answers. Our supplier tells us we won’t get new gear for a week. Everything feels like a growing list of bad options. Ultimately, we opt to prioritize customer stability. We prepare lots of contingencies and rules for when to invoke them. Mostly, we wait. It seemed like days.

By now, you know that our load balancer vendor confirms a bug in our firmware. There is workaround that we can apply through a standard maintenance process. This unleashes a wave conflicted feelings. I feel huge relief that we have a conclusive explanation that doesn’t require days of nursing our systems alongside massive frustration over a firmware bug that shows up twice in one day after weeks running smoothly. We set the emotions aside and plan out the remaining tasks. Our services remain stable during the day. That evening, we apply all our changes and move everything back to its normal operating mode. After some prodding, our supplier manages to air ships our new load balancers to Ashburn. Movement feels good. The waiting is the hardest part.

The Aftermath

TL;DR: Multiple problems can chain into several painful, embarrassing incidents in a matter of days. I use those words to truly express how this feels. These events are now understandable and explainable. Some aspects were arguably outside of our control. I still feel pain and embarrassment. But we move forward. As I write this, the workarounds appear to be working as expected. Our new load balancers are being racked in Ashburn. We proved our primary metro can go down without issues since the vendor had a maintenance on their problematic fiber just last night. We are prepping tools and processes for handling new operations. Hopefully, we are on a path to regain your trust.

We have learned a great deal and have much work ahead of us. A couple of things stand out. While we have planned redundancy into our deployments and improved our live testing over the past year, we haven’t done enough and have a false sense of security around that – particularly when running at peak loads. We are going to get much more confidence in our failover systems and start proving them in production at peak load. We have some known disruptive failover processes that we hope to never use and will not run during the middle of your day. But, shifting load across DCs or moving between redundant networking links should happen without issue. If that doesn’t work, I would rather know in a controlled environment with a full team at the ready. We also need to raise our sense of urgency for rapid follow up on outage issues. That doesn’t mean we just add them to our list. We need to clear room for post-incident action explicitly. I will clarify the priorities and and explicitly push out other work.

I could go on about our short comings. However, I want to take time to highlight what went right. First off, my colleagues at Basecamp are truly amazing. The entire company felt tremendous pressure from this series of events. But, no one cracked. Calmness is my strongest recollection from all of the long calls and discussions. There were plenty piercing questions and uncomfortable discussions, don’t get me wrong. The mood, however, remained a focused, respectful search for the best path forward. This is the benefit of working with exceptional people in an exceptional culture. Our redundancy setup did not prevent these outages. It did give us lots of room to maneuver. Multiple DCs, a cloud presence and networking options allowed us to use and explore lots of recovery options in a scenario we had not seen before. You might have noticed that HEY was not impacted this week. If you thought that is because it runs in the cloud, you are not entirely correct. Our outbound mail servers run in our DCs. So no mail actually sends from the cloud. Our redundant infrastructure isolated HEY from any of these Basecamp problems. We will keep adapting and working to improve our infrastructure. There are more gaps than I would like. But, we have a strong base.

If you’ve stuck around to the end, you are likely a longtime Basecamp customer or perhaps a fellow traveller in the operations realm. For our customers, I just want to say again how sorry I am that we were not able to provide the level of service you expect and deserve. I remain committed to making sure we get back to the standard we uphold. For fellow ops travelers, you should know that others struggle with the challenges of keeping complex systems stable and wrestling with feelings of failure and frustration. When I said there was no blaming going on during the incident, that isn’t entirely true. There was a pretty serious self-blame storm going on in my head. I don’t write this level of personal detail as an excuse or to ask for sympathy. Instead, I want people to understand that humans run Internet services. If you happen to be in that business, know that we have all been there. I have developed a lot of tools to help manage my own mental health while working through service disruptions. I could probably write an entire post on that topic. In the meantime, I want to make it clear that I am available to listen and help anyone in the business that struggles with this. We all get better by being open and transparent about how this works.

signalvnoise signalvnoise

« PEST v0.3 is now released - Weekly Update 207 »