Three Basecamp outages. One week. What happened?

Written by Signal vs. Noise / Original link on Sep. 3, 2020

Basecamp has suffered through three serious outages in the last week, on Friday, August 28th, on Tuesday, September 1, and again today. It’s embarrassing, and we’re deeply sorry.

This is more than a blip or two. Basecamp has been down during the middle of your day. We know these outages have really caused issues for you and your work. We’ve put you in the position of explaining Basecamp’s reliability to your customers and clients, too.

We’ve been leaning on your goodwill and we’re all out of it.

Here’s what has happened, what we’re doing to recover from these outages, and our plan to get Basecamp reliability back on track.

What happened

Friday, August 28

Tuesday, September 1

Wednesday, September 2

All told, we’ve tickled three obscure, tricky issues in a 5-day span that led to overlapping, interrelated failure modes. These woes are what we plan for. We detect and avert these sorts of technical issues daily, so this was a stark wake-up call: why not today? We’re working to learn why.

What we’re doing to recover from these outages

We’re working multiple options in parallel to recover and manage any contingencies in case our recovery plans fall through.

  1. We’re getting to the bottom of the load balancer crash with our vendor. We have a preliminary assessment and bugfix.
  2. We’re replacing our hardware load balancers. We’ve been pushing them hard. Traffic overload is a driving factor in one outage.
  3. We’re rerouting our redundant cross-datacenter network paths to ensure proper circuit diversity, eliminating the surprise interdependency between our network providers.
  4. As a contingency, we’re evaluating moving from hardware to software load balancers to decrease provisioning time. When a hardware device has an issue, we’re days out from a replacement. New software can be deployed in minutes.
  5. As a contingency, we’re evaluating decentralizing our load balancer architecture to limit the impact of any one failure.

What we’re doing to get our reliability back on track

We engineer our systems with multiple levels of redundancy & resilience precisely to avoid disasters like this one, including practicing our response to catastrophic failures within our live systems.

We didn’t catch these specific incidents. We don’t expect to catch them all! But what catches us by surprise are cascading failures that expose unexpected fragility and difficult paths to recovery. These, we can prepare for.

We’ll be assessing our systems for resilience, fragility, and risk, and we’ll review our assessment process itself. We’ll share what we learn and the steps we take with you.

We’re sorry. We’re making it right.

We’re really sorry for the repeated disruption this week. One thing after another. There’s nothing like trying to get your own work done and your computer glitching out you or just not cooperating. This one’s on us. We’ll make it right.

We really appreciate all your understanding and patience you’ve shown us. We’ll do our best to earn back the credibility and goodwill you’ve extended to us as we get Basecamp back to rock-solid reliability. Expect Basecamp to be up 24/7.

As always, you can follow along with live updates about Basecamp status here and follow the play-by-play on Twitter, and get in touch with our support team anytime.

signalvnoise signalvnoise

« Launching Blade UI Kit v0.1 - Conclusion »