Failure injection for AWS Lambda, Azure Functions and Cloud Functions

Chaos engineering is the practice of performing controlled experiments on our systems in order to learn new things about how it behaves and to build confidence both in the system and in our organization. It helps us build reliable and robust distributed systems.

Creating modern applications using serverless technology and managed services means that anyone can build distributed and highly available systems without worrying about the underlying infrastructure. But it’s not all fun and games. This talk I did at AWS re:Invent 2019 with the title “Performing chaos engineering in a serverless world” explains some of the challenges with serverless, common weaknesses in serverless applications as well as challenges with using chaos engineering in serverless:

The slides for the talk (https://speakerdeck.com/gunnargrosch/performing-chaos-engineering-in-a-serverless-world-cmy301-aws-re-invent-las-vegas-december-2-2019)

Inject failure in our functions

AWS Serverless Hero Yan Cui has written several articles about latency injection for AWS Lambda, see “How can we apply the principles of chaos engineering to AWS Lambda?” (https://theburningmonk.com/2017/10/how-can-we-apply-the-principles-of-chaos-engineering-to-aws-lambda/) and “Applying principles of chaos engineering to AWS Lambda with latency injection” (https://hackernoon.com/chaos-engineering-and-aws-lambda-latency-injection-ddeb4ff8d983) from October 2017. These articles explain why we could and should use chaos engineering in our serverless applications as well as showing examples on how to do it.

AWS Principal Developer Advocate Architecture Adrian Hornsby expanded on this by first creating a Lambda layer and later a Python library for failure injection, chaos_lambda (https://github.com/adhorn/aws-lambda-chaos-injection), giving developers an easier way to get started with chaos experiments for AWS Lambda. Just install the library, wrap your functions with the appropriate failure mode and you are ready to start injecting failure!

To reach the same level of simplicity for NodeJS developers I late last year created an NPM package called failure-lambda (https://github.com/gunnargrosch/failure-lambda). The goal with failure-lambda is in short to have an easy way to do failure injection in AWS Lambda using several different failure modes. To make it even easier I decided to use a single wrapper and instead have the failure mode selectable. That way you don’t have to make code changes if you want to switch between for example latency or exception injection, you just change a setting.

As we know serverless isn’t just about AWS and chaos engineering for serverless isn’t only about AWS Lambda. For that reason, there are now also the same failure injection options for NodeJS developers building serverless using Azure Functions and Cloud Functions. This with the NPM packages failure-azurefunctions (https://github.com/gunnargrosch/failure-azurefunctions) and failure-cloudfunctions (https://github.com/gunnargrosch/failure-cloudfunctions).

Failure modes and rate of failure

Even though this all started with latency injection as in Yan Cui’s articles, latency is far from the only possible failure we can have in our serverless applications. In failure-lambda, failure-azurefunctions and failure-cloudfunctions there are now five different failure modes to choose from:

  • Latency
    Injects latency to the executed function, controlled using a minimum and maximum span of milliseconds. This can for example be used to simulate service latency or to test and help set your timeout values.
  • Exception
    Throws an exception in the function. Helps you test how your application and code handles exceptions.
  • Status code
    Your function will return a status code of choice, for instance 502 or 404 instead of the normal 200. This gives you the possibility to test what happens when there are errors.
  • Disk space
    Will fill your temporary disk with files to create a failure. If you’re using disk to store temporary files you can test how your application behaves if that disk gets full or you are unable to store to it.
  • Blacklist (courtesy of Jason Barto)
    Blocks connections to specified hosts. Use to simulate services or third parties being unavailable.

All these failure modes can be used together with a rate of failure that you set. The default is to inject failure on every invocation but in reality, it is likely that for example a third party is unavailable on 50% of the calls made to that host or that an exception is thrown on a quarter of the invocations. Setting rate will allow you to achieve this.

Getting started

All three NPM packages contain step by step instructions on how to get started using them and injecting failure. There are also example applications that you can use to try it out.

https://github.com/gunnargrosch/failure-lambda#how-to-install
https://github.com/gunnargrosch/failure-azurefunctions#how-to-install
https://github.com/gunnargrosch/failure-cloudfunctions#how-to-install

If you want an even more in-depth explanation, I will show you how to install and use them all in the next article in this series.

About Gunnar

Gunnar is an evangelist at Opsio and an AWS Serverless Hero. He has previously worked as both a frontend and backend developer, as an operations engineer within cloud infrastructure, and as a technical trainer, in addition to several different management roles.

With a focus on building reliable and robust serverless applications, Gunnar has been one of the driving forces in creating techniques and tools for using chaos engineering in serverless. He regularly and passionately speaks at events on these and other serverless topics around the world.

Gunnar is also deeply involved in the community by organizing AWS User Groups and Serverless Meetups in the Nordics, as well as being an organizer of ServerlessDays Stockholm and AWS Community Day Nordics.