Breaking Azure Functions with Too Many Connections

Written by Troy Hunt - - Aggregated on Wednesday October 10, 2018
Tags: have-i-been-pwned, azure

For the most part, Have I Been Pwned (HIBP) runs very smoothly, especially given how cheaply I run many parts of the service for. Occasionally though, I screw up and get something wrong that interrupts the otherwise slick operation and results in some outage. Last weekend was one such occasion and I want to explain what I got wrong, how you might get it wrong too and then, of course, how to fix it.

But first, some background: A few months ago I announced that Mozilla and AgileBits were baking HIBP into Firefox and 1Password (I'll call them "subscribers" in this post). They're leveraging the anonymity model described there to ensure their users can search for their identities without me ever knowing what they are. In addition, they can also sign those identities up to receive notifications if they're pwned in the future, again, via an anonymity model. The way the notifications work is that if an identity is signed up, it's just the first 6 characters of a SHA-1 hash of their email address that's sent to HIBP. If test@example.com is signed up, it's 567159 being those first chars which gets stored in HIBP. When I then load a data breach, every email address in the source data is hashed and if any begin with 567159 then a callback needs to be sent to an API on the subscriber's end. Because there may be multiple email addresses beginning with that hash prefixes, a collection of suffixes is sent in the callback and the subscriber immediately discards any that don't match the ones they're monitoring. For example, a callback would look like this:

{
  hashPrefix: "567159",
  breachName: "Apollo"
  hashSuffixes: [
    "D622FFBB50B11B0EFD307BE358624A26EE",
    "..."
  ]
}

You can see the breach name here is "Apollo" and that's an incident of some 126M email addresses which you can read about in this WIRED piece. That's a very large breach and I loaded it not long after Mozilla went live with their service which attracted a very large number of people who signed up for notifications via their service. This led to two very large numbers amplifying each other to create the total number of callbacks I needed to send after the data load.

Now, the way the callbacks work is they all go into an Azure storage queue with each new message triggering a function execution. Azure Functions are awesome because they're "serverless" which, in theory at least, means no logical infrastructure boundaries and you simply pay per execution. Except there are boundaries, one of which caused the aforementioned outage.

So that's the background, let's get onto what happened when the notification queuing process began lining up those hash range callbacks:

What you're seeing here in green the number of invocations of the Function and the orange line show you the average execution time. There's absolutely no action in the leadup to about 07:30 because there are no callbacks to send. However, that's when I started dumping them into a queue which in turn began triggering the function. You can see the steady ramp-up of executions as the Azure Function service spins up more and more instances to process the queue messages. You can also see the execution time flat-lining at 10 seconds all the way until just after 08:00. That's the first problem so let's talk about that:

The pattern above absolutely hammered Mozilla. I was looking at the logs as messages were queuing and noticed a heap of the callbacks being aborted at the 10 second timeout mark. This is the period that I set on the HttpClient object in the function so that a request isn't left hanging for too long if the receiver can't respond. I noticed it pretty quickly (this was the first load of a really big data set since they came on board so I was watching carefully) and reached out to them. They acknowledged the issue and logged a bug accordingly:

We are receiving a spike of HIBP callback requests which look up our subscribers by sha1 hash. The RDS node is maxing on CPU. :(

When I originally implemented the callback logic, I built a retry model where if the call fails it will put it back into the queue and set the initial visibility delay to a random time period between 1 and 10 minutes. The rationale was that if there were a bunch of failures like these ones here, it would be better to wait a while and ensure there's not a sudden influx of new requests at the same time. And more than 5 consecutive failures using that model would cause the message to go to a "poison" queue where it would sit until the message expired.

In theory, this is all fine; Mozilla was getting overwhelmed but the messages weren't lost and they'd be resent later on. Just to take a bit of pressure off them, I pushed a change which multiplied that visibility period by 10x so the message would appear between 10 and 100 minutes later. (The additional period doesn't really matter that much, it would simply mean that subscribers to Mozilla's service would receive their notification up to about an hour and a half later.) In pushing that change, I moved the definition of the minimum and max viability periods to config settings so I could just tweak them in the Azure portal later on without redeploying. And then stuff broke.

I didn't see it initially because it was Saturday morning and I was trying to have brekkie with the family (so yeah, it's really the kids' fault


« Control Flow Integrity in the Android … - Google Online Security Blog

Laravel News - Carbon Updates Coming to Laravel 5.8 »