Serverless throttling: denial of wallet

When designing a product or subsystem, you need to look at the cost model. You ask and answer questions like:

Where will this be deployed?
What are the fixed costs of deploying there?
What controls do we have, especially for max costs?
What are the predicted, marginal costs as usage increases?
Does the value we produce (money or other types of value) scale with that usage enough to justify those future costs?

For a recent project, I evaluated serverless options for hosting customer-facing traffic (web requests and formal APIs). The main attractions are usage-based pricing and a lower operational burden.

This is for a regularly-called service, intaking data and being repeatedly queried. It's not something that is called occasionally to adjust a configuration here and there.

To make a decision, there are many things to consider beyond the cost model, but it's a pretty important one. For the way I want to run this type of service there didn't appear to be a good pure-serverless option.

There are many reasonable options out there. I most heavily researched Cloudflare Workers, Google API Gateway/Cloud Functions, and many AWS offerings (API Gateway->Lambda, ALB->Lambda, and Lambda function URLs). I also saw similar issues in services I looked at more briefly. I'd be happy to know about viable options.

To get this out the way: denial-of-service attacks are a risk, but they are at least rare. A malicious actor wants to do damage, stopping legitimate customers from using your service and causing you much grief and runaway costs in the process. It could happen at the packet level (traditional DDoS) or from thousands of coordinated clients using bogus API credentials.

If you have auto-scaling configured or your bandwidth is metered, your costs could spike considerably. You do get to decide where to stop scaling capacity on most products, whether you're running serverless functions or auto-scaling IaaS/container instances. And vendors will sometimes forgive bills after DDoS incidents (not that you should count on it).

But I'm not focused on those kinds of rare occurrences. Yes, they are a risk, and you need a plan. But they are temporary problems (we hope!).

What I've seen in my career are services with a constant hum of throttles and authorization errors.

Good services refuse traffic if they are overloaded or if a client has used their quotas, as this protects service's other customers, and it protects the service itself.

Every client that gets a 429 error dutifully backs off, right? They also have adaptive algorithms that look at the success vs. refusal rates and zero in on the correct call rate to match the current conditions. Mmm-hmm, that's exactly what happens 👌

What actually happens is that most clients call your service in a loop and keep trying. Or someone accidentally forgot to add a zero on the parameter to a sleep() call. Or only one in five calls succeeds, but no one ever notices/cares because their higher-level objectives are met, and there are no outbound call metrics/tracing. Or an SDK/HTTP library is retrying behind the scenes and only bubbles up an error past a specific retry count. Or people are just seeing what happens when they call something a lot (it happens).

Clients also keep calling with expired/deactivated keys. They may have stopped being your customer and forgotten about something calling the API regularly. Maybe people are aware of that failing call in the cronjob, but it's just not important to fix right now. Perhaps it's a bad actor, and the API key never worked in the first place.

On top of it all, many of your attempts to contact people will go nowhere. In some cases, you may decide to point out what is happening and see if it would be possible for someone to tone down their errant client. This may even be helpful to them, as it might be the first time they're hearing about the problem. We saw some pretty high client call rates at AWS, far past the throttling thresholds, and while they were likely misconfigured setups, it was tough to get in touch to ask what was happening.

These are things that happen. And it's all perfectly OK. We're better off accepting reality and planning for it.

At the very least, we want per-client throttling. Actions from one customer should not be able to affect other customers. This would require a serverless product to analyze your traffic and look for identifying information, most likely one contained in an HTTP header.

Out of all the serverless options I looked at, Amazon API Gateway was the only one I found with this capability. They have a "usage plan" concept that you can tie to a specific API key and then limit the total calls made by that one key. There's a soft limit of 300 per region per account (there are techniques to spread to multiple accounts if you have to, but you'd be moving into do-it-yourself territory).

They send throttled responses for you at no extra charge, which was a welcome surprise. Unfortunately, the base prices for API calls are too high for the services I was designing. Your budget may vary (YBMV). There are even-pricier WAF options from vendors that handle some of these needs.

Amazon API Gateway was the only thing close to what I was looking for. Some of the other solutions give you a way to limit the entire amount of traffic heading to your service. It is definitely important to have that type of backstop, but particular clients getting out of control is a much bigger, daily issue.

If you have 1000 callers, and 50 of them are going nuts or doing malicious things, you should have 950 happy customers. There is a lot more you can do to reduce the blast radius from issues with one customer (one of my favorite subjects, and there's a lot to it). But per-client throttling at the border is table stakes.

This is where the biggest problem starts. If the vendor needs to invoke your serverless function for you to decide to return a throttled error, your cost of mitigation is now similar or equal to the cost of a normal API call.

Plan accordingly.

You're charged by call rate, and then on top of this, you're either charged by fine-grained usage (e.g., in Lambda, you are metered by the ms) or there is a limit on the duration (e.g., in one of the paid Cloudflare Workers plan, you get up to 50ms).

That latter model can be insidious: your sub-millisecond throttling decision has used one "unit" in the vendor's pricing model. And they come up with their pricing by ensuring that potentially hundreds of times more expensive calls are also covered at their maximum allowable call rate. Their cost models need to work, too, and that's OK. But in this scheme, you'll get less value from their offering as the number of calls that need to be rejected and throttle increases. Note that Cloudflare Workers are attractively priced (in my opinion), and they have a different option that is metered in a more fine-grained way; I'm just pointing out a dynamic of such a model. At a certain scale, it starts to have an effect.

You also need access to global state for your throttling decisions to be remotely accurate. On Cloudflare, you might choose Durable Objects. On AWS Lambda, you might choose Elasticache (and now you're moving away from serverless again). Nothing will be free (nor should it be).

As you consider concrete scenarios, the cost model may appear to work because each new customer pays you more, and each one has a quota. But you should consider this whole other class of calls that don't fall into that category. They are a potentially unbounded input into the system, and they are hard to predict ahead of time. If you need to pay the vendor for them, in terms of call count, they don't see any difference between a normal function call and one that returns throttled/rejected.

In a more "serverful" solution (running on instances or a managed container service), the price of returning throttled errors can be quite cheap. I prefer using a combination of Redis (for cross-instance correctness) and short-lived in-memory tables (for dealing with way-over-the-top call rates and short-term caching of authorization decisions).

You can address repeated, nefarious behavior or astronomical call rates by temporarily cutting off source IP addresses. This can happen in your intake logic, but even better, you can dynamically adjust your initial HTTP proxy or dynamically add rules to your cloud provider's border firewall (which often has limits on the number of rules, but they at least won't cost you any money to enforce on a per-call basis).

Another approach to the decision might be "deal with it later." It's often a good one! Serverless API products could get you out of the door faster, and so they get MVPs/experiments into customer's hands faster. If something gets traction, you'll look at migrating it later. You're OK with taking a risk that some customers might cause issues for the whole service in the beginning. Your worst-case scenario has you paying a few hundreds of dollars a month extra in the first year, but the engineering time and operational overhead for a different solution is far more costly.

A migration like that can be a large amount of work. But if you organize your code/architecture in certain ways, you can at least reduce those effects if they ever happen (without costing you much in the short term). That's different from over-building/over-generalizing actual code to actually handle multiple deployment scenarios.

What would an attractive serverless solution look like?

Throttle by client identifier (e.g., API key in a header or a hash of it)
Perform basic border authentication by rejecting unknown API keys
Allow us to adjust the rules via API calls to your service. We will be adding and removing clients daily.
Allow us to add separate quotas for different method & path combinations (e.g., "POST /records" should be treated separately).
Allow us to add different types of quotas on the fly (e.g., we offer different rates to our customers if they get different subscriptions, and they may upgrade, etc.)
Throttle at a global ceiling, but base this on the total successful calls (not total calls, which include all the junk traffic being rejected/throttled already)
Probably asking for too much: allow for millions of unique method+path rules (to allow for quotas based on unique resource IDs in the path)
Charge us what's necessary to make it work. This is a rules engine that you could 100% control, and it should in practice be far less expensive than invoking a customer-provided, artbitrary serverless function. But it doesn't need to be free.