Stories by Chris Dobson on Medium

Workflow orchestration with Lambda Durable Functions — part 2

Chris Dobson — Sun, 14 Dec 2025 18:32:18 GMT

Workflow orchestration with Lambda Durable Functions — part 2

Having previously created a Durable Function workflow now it’s time to add some resiliency and have a look at testing.

Replay

While the workflow I created in the first article worked nicely it did have one issue.

I noticed when testing it out that the functions passed to the parallel step appeared to be consistently executing twice, once when I would have expected and then once more when a callback was received. This could be seen by something that had been confusing me when I looked at the execution log:

The log above shows that the Callback has Succeeded but the Step had started which made no sense to me, prior to the callback being received the Step showed as Succeeded and the Callback as started which did make sense.

Looking in the CloudWatch logs prior to the function being executed a second time this platform.start line was in the log:

After a bit of reading I realised this was due to the way that durable functions are executed, specifically the checkpoint and relay mechanism that is used. It looks like this is what was happening:

The function hits a point where it needs to wait — waiting for callback in this case — the execution gets suspended.
The callback is received so the function needs to resume.
When resuming the function starts from the beginning and skips the first four steps as they have been completed.
As the ‘execute commands’ step hasn’t been completed it is executed again thereby executing all of the functions again.

The way around this is to add some idempotency into the execution of those functions — this workflow is simulating a distributed system so hopefully there’d be some idempotency downstream but it seems a good idea to add some to the workflow as well.

The first step creates a unique processId and as this won’t be executed again this id can be used as an idempotency key. Using that I’ve added a check to each function using a very basic idempotency implementation in the executeOnce higher order function:

const commandResults = await context.parallel("run commands", [
  async (parallelContext) => {
    await parallelContext.waitForCallback(
      "command one",
      async (callbackId) => {
        await executeOnce(processId, "commandOne", async () => commandOne(callbackId));
      },
      {
        timeout: { hours: 1 },
      }
    );
  },
  async (parallelContext) => {
    await parallelContext.waitForCallback(
      "command two",
      async (callbackId) => {
        await executeOnce(processId, "commandTwo", async () => commandTwo(callbackId));
      },
      {
        timeout: { hours: 1 },
      }
    );
  },
]);

Once this was added the functions were no longer executed twice.

Retries

One thing I didn’t implement in the first article which I would usually want when orchestrating a workflow are retries. In a durable execution the step and waitForCallback methods allow a retry strategy to be declared which will be used whenever an exception is thrown.

A retry strategy is a function that is passed two parameters — the error that was thrown and the attempt number — and returns an object containing a shouldRetry property indicating whether or not to retry and duration which indicates how long to wait before retrying.

All but one of the steps in this workflow are using AWS services and I’ve given them a very simple strategy — try 3 times regardless of error and wait for 2 minutes longer every retry for a simple backoff. The strategy function looks like this:

const strategy = (_: unknown, attempt: number) => ({
  shouldRetry: attempt <= 3,
  delay: { minutes: attempt * 2 },
});

The commandTwo function makes an HTTP request so I’ve made the retry strategy for that a little bit more complex by declaring a list of status codes that may be transient errors and only retrying if the status is one of those and also giving the service a bit more time to recover. So, for instance, if I receive a 400 or 404 there’s no point retrying as I will get exactly the same response. The code for that strategy function is here:

const strategy = (error, attempt) => {
  const httpError = error as HttpError;
  return {
    shouldRetry: httpError.status
      ? transientErrorStatuses.includes(httpError.status) && attempt <= 5
      : attempt <= 5,
    delay: { minutes: 5 * attempt },
  };
};

A note about retries is that they do not apply to a failure being returned in a callback — I’m not sure whether I expected them to or not to be honest. So if a waitForCallback receives a failure notification then an exception is thrown regardless of any retry strategy.

Parallel failures

Turns out I missed a couple of things when first using the parallel method. I was testing for success by checking the status of each result like this:

const failedCommands = 
  commandResults.all.filter((result) => result.status === "FAILED");
if (failedCommands.length) {
  throw new Error(`${processId} failed ${JSON.stringify(failedCommands, null, 2)}`);
}

Which worked fine however there are better ways of doing this. The return from parallel includes a completionReason property which could be tested instead:

if (commandResults.completionReason !== "ALL_COMPLETED") {  
  throw new Error(`${processId} failed ${JSON.stringify(commandResults.failed(), null, 2)}`);
}

Also there are failed() and succeeded() methods which return lists of the failed/successful items.

The completionReason property isn’t just a success or fail it can be one of ALL_COMPLETED , MIN_SUCCESSFUL_REACHED or FAILURE_TOLERANCE_EXCEEDED and this is because the parallel method allows a failure tolerance to be configured. The completionConfig property can be defined something like this:

context.parallel("name", [.....], {
  minSuccessful: 2,
  toleratedFailureCount: 1,
  toleratedFailurePercentage: 20,
});

Where minSuccessful is the minimum number of executions that can succeed for the step to be considered a success and the toleratedFailureCount and toleratedFailurePercentage are the number and percentage of failures tolerated respectively.

In this workflow I haven’t defined a completionConfig as I would like all of the executions to be successful.

Testing

One thing that can be difficult to do with orchestrations such as this is to test it locally but in the case of durable functions that has been made much easier. The @aws/durable-execution-sdk-js-testing library is available and can be used to test the function both locally, and therefore in a CI environment, and test the deployed function in the AWS environment. I will be concentrating on the local testing in this article.

While I was expecting to be able to use it in vitest I had a problem with running it — I suspect there is some config I’ve missed and the fact that I gave up quickly is due to impatience on my part. So I installed jest instead and things ran well — although as I needed support for ESM I had to use the --experimental-vm-modules node flag.

Firstly I wanted to test a happy path so therefore needed to create a test that could send callback notifications both from a single waitForCallback and two running in parallel. This turned out to be pretty simple.

Firstly I needed to add code to setup and teardown a test environment and create a local runner to execute the test:

import { handler } from "../index";

beforeAll(() => LocalDurableTestRunner.setupTestEnvironment({ skipTime: true }));
afterAll(LocalDurableTestRunner.teardownTestEnvironment);

const runner = new LocalDurableTestRunner({ handlerFunction: handler });

To execute the function I could now use runner.run() which returns a promise however I didn’t want to await it at that point as I needed to setup the various callbacks I needed first:

const executionPromise = runner.run();

const approvalCallback = runner.getOperation("ask for approval");
await approvalCallback.waitForData(WaitingOperationStatus.STARTED);
approvalCallback.sendCallbackSuccess(JSON.stringify({ approved: true }));

const commandOneCallback = runner.getOperation("command one");
await commandOneCallback.waitForData(WaitingOperationStatus.STARTED);
commandOneCallback.sendCallbackSuccess();

const commandTwoCallback = runner.getOperation("command two");
await commandTwoCallback.waitForData(WaitingOperationStatus.STARTED);
commandTwoCallback.sendCallbackSuccess();

const execution = await executionPromise;

Once the run() method has been called — but not awaited — I could then access the different operations that required callbacks, wait for them to start and then send a success notification. Once that was all setup I could await the promise and do some assertions.

As every operation can be accessed through the runner I could then check that each operation had done what I expected — for instance this code checks the create process operation returned the expected processId :

const createProcessOperation = runner.getOperation("create process");
expect(createProcessOperation.getStepDetails()?.result).toBe(processId);

Once the happy path was tested I wanted to test that any callbacks receiving a failure notification would fail the workflow. For instance to test a failure when asking for an approval:

const approvalCallback = runner.getOperation("ask for approval");
await approvalCallback.waitForData(WaitingOperationStatus.STARTED);
approvalCallback.sendCallbackFailure();

As well as testing for failure notifications I wanted to test that if any running processes didn’t end in time then also fail the workflow. The waitForCondition that checks for running processes makes 10 attempts with 10 minutes between each attempt — unsurprisingly I didn’t really want a test that lasted for 100 minutes! Because I setup the test environment using the skipTime option I didn’t have to — the read from the Dynamo table was mocked so I changed the mock to always return an item in the list and the test executes each attempt without having to wait. So the test looked like this:

it("should fail if running processes do not complete", async () => {
  mockCreateProcess.mockResolvedValue(processId);
  mockFindRunningProcesses.mockResolvedValue([{ processId: "1" }]);

  const executionPromise = runner.run();

  const approvalCallback = runner.getOperation("ask for approval");
  await approvalCallback.waitForData(WaitingOperationStatus.STARTED);
  approvalCallback.sendCallbackSuccess(JSON.stringify({ approved: true }));

  const execution = await executionPromise;

  expect(execution.getStatus()).toBe("FAILED");
});

Finally I wanted to test a retry strategy. This time using a mock to reject the promise first time and then in subsequent attempts resolve it meant that I could ensure that the step was re-tried correctly. This tests the retry strategy for the create process step:

it("should complete if create process fails once", async () => {
  mockCreateProcess.mockRejectedValueOnce(new Error()).mockResolvedValue(processId);
  ...
  const executionPromise = runner.run();
  ...
  const execution = await executionPromise;

  expect(execution.getStatus()).toBe("SUCCEEDED");
});

This is the test suite I ended up with — while it doesn’t test every single step thoroughly I was able to prove to myself that I would be able to if I had the time and inclination!

So I’ve ended up with a fairly resilient orchestration using Lambda Durable Functions that uses callbacks, runs processes in parallel and waits for a condition, which has a test suite running locally. Mission accomplished ✅

The final code for the workflow Lambda is here and is deployed using this CDK.

Workflow orchestration with Lambda Durable Functions — part 2 was originally published in AWS Tip on Medium, where people are continuing the conversation by highlighting and responding to this story.

Workflow orchestration with Lambda Durable Functions — part 1

Chris Dobson — Wed, 10 Dec 2025 09:35:35 GMT

Workflow orchestration with Lambda Durable Functions — part 1

AWS have recently released Lambda Durable Functions which allow us to build multi-step workflows so I thought I’d give them a go and try and build one of these multi-step workflows. This is how I got on…

The workflow

The workflow I built was based on something I was planning on implementing in my job using Step Functions.

It’s a fairly simple flow for the time being — I need to ask a user for an approval and issue commands to other domains to execute long running processes and once they are completed the workflow is complete.

In addition as the processes that will be executing cause significant pressure downstream I only want to run one of the workflows at once.

Finally I want some record of the status of the process.

So the different steps are:

Creating the function

I created the function using node and Typescript so the first thing to do is install the @aws/durable-execution-sdk-js package — this package is available in the Lambda execution environment so there’s no need to bundle it with your deployment if you don’t wish.

Creating a durable function is a case of wrapping a handler with the withDurableExecutionhigher order function:

export const handler = withDurableExecution(async (event, context) => {

The event parameter is the same as for a regular Lambda — the event that triggered the function — while the context is a DurableContext which gives us access to the step-based workflow. In this article I will use the following methods of the context object:

step
waitForCallback
waitForCondition
parallel

So the first part of the workflow is to create a record in a Dynamo table for the process. This can be done inside a step:

const processId = await context.step("create process", async (stepContext) => {
  stepContext.logger.info("creating new process");
  return await createProcess();
});

The first parameter is the name of the step — all context methods I’ve used have this parameter — and the second is the function used to execute the step. This function is provided with a StepContext object which provides a logger as I’ve used in the code. The createProcess helper adds a record to the Dynamo process table and returns a process id and that becomes the return value of the step.

So I’ve now started the workflow by adding a process record to the table…

Asking for user approval

Next I need to seek approval from a user.

I’ve created a helper function — sendApproval — that will construct a URL including the detail needed to callback into the function and publishes it to an SNS topic. For the purpose of testing I can then subscribe my email address to the topic, click on the URL and approve the request.

I need to send that message and wait for the user to approve — to do this I use the waitForCallback method. This method accepts a function which gets passed a callback id which can then be used to make a callback to the function:

const userApproval = await context.waitForCallback(
  "ask for approval",
  async (callbackId) => {
    await sendApproval(callbackId);
  },
  {
    timeout: { hours: 1 },
  }
);

In this case I’ve added a timeout of one hour meaning that my user has an hour to approve the request.

To callback to the function the Lambda sdk has two new commands added — SendDurableExecutionCallbackSuccessCommand and SendDurableExecutionCallbackFailureCommand — which take two parameters the callback id and an optional result. So to send a success notification back with a result indicating that a user approved the request would look something like this:

lambdaClient.send(new SendDurableExecutionCallbackSuccessCommand({
  CallbackId: callbackId,
  Result: JSON.stringify({ success: true }),
})

The result is then returned from waitForCallback - if a failure notification is sent or the request times out then an exception is thrown, which if not caught will cause the function to fail.

Waiting for other workflows to complete

Next I need to ensure that there are no processes currently running before I can start this one. To do this I need to scan the process table to check for anything with a status of in-progress, if there are none then the process can start, if there are then wait for a length of time and scan again until the process can start.

This can be done using the waitForCondition method — it takes parameters of the step name, a function to check the table and return a boolean indicating whether there are any currently running processes along with an object specifying the initial state and a wait strategy.

For this case I’ve set the initialState to false and a waitStrategy that will wait for 10 minutes between attempts and will try for 10 attempts before giving up:

const haveAllProcessesFinished = await context.waitForCondition(
  "check for running processes",
  async () => {
    const runningProcesses = await findRunningProcesses();
    return runningProcesses.length === 0;
  },
  {
  initialState: false,
    waitStrategy: (state, attempt) => ({
      shouldContinue: !state && attempt <= 10,
      delay: { minutes: 10 },
    }),
  }
);

Unlike waitForCallback once waitStrategy.shouldContinue gets set to false no exception is thrown so I test the return value and throw an exception if the result is false (there are running processes):

if (!haveAllProcessesFinished) {
  throw new Error("Processes not finished in a timely manner");
}

Once there are no processes running I can use another step to update the process record to have a status of in-progress :

await context.step("start process", async () => {
  await setProcessStatus(processId, "in-progress");
});

Executing commands

The main part of this workflow is the execution of two commands — I need to wait until both of these commands have completed before moving on. As previously I’ll be using the waitForCallback method to execute the commands and wait for a notification so the code for both would look like this:

await context.waitForCallback(
  "command one",
  async (callbackId) => {
    await commandOne(callbackId);
  },
  {
    timeout: { hours: 1 },
  }
);

await context.waitForCallback(
  "command two",
  async (callbackId) => {
    await commandTwo(callbackId);
  },
  {
    timeout: { hours: 1 },
  }
);

However I don’t want to have to execute these sequentially, ideally I’d like to execute them at the same time and move on once both have completed. Fortunately there is a parallel method that can be used for just that. As well as a name parameter I also pass an array of functions to be executed in parallel:

const commandResults = await context.parallel("run commands", [
  async (parallelContext) => {
    await parallelContext.waitForCallback(
      "command one",
      async (callbackId) => {
        await commandOne(callbackId);
      },
      {
        timeout: { hours: 1 },
      }
    );
  },
  async (parallelContext) => {
    await parallelContext.waitForCallback(
      "command two",
      async (callbackId) => {
        await commandTwo(callbackId);
      },
      {
        timeout: { hours: 1 },
      }
    );
  },
]);

Rather than use the function context parameter when inside a parallel method the functions get passed a separate DurableContext object.

When a waitForCallback either times out or receives a failure notification it will throw an exception, however when inside the parallel method the exception will be caught and returned along with other executions. So I’m checking the return value from the parallel call and looking for any with a status of FAILED — if there are any then throw an exception so that the function execution fails:

const failedCommands = commandResults.all.filter((result) => result.status === "FAILED");
if (failedCommands.length) {
  throw new Error(`${processId} failed ${JSON.stringify(failedCommands, null, 2)}`);
}

Once the commands have been successfully executed the workflow is complete.

I’ve mentioned a few times that throwing an exception will fail the execution of the function which is what I want, however before completing the execution I need to change the status of the process in the table so that any processes waiting to start will be able to. To do this, as you would expect, I catch the exception, then set the status in the process table to failed and rethrow the exception:

catch (error) {
  context.logger.error(error);
  await setProcessStatus(processId, "failed");
  throw error;
}

Deployment

I’ve deployed this demo using CDK — a new durableConfig property has been added to the Lambda types. In this case I’ve set the execution timeout to 2 days which, given that everything times out after an hour, will be far too long and the retention period for the execution logs to 14 days:

durableConfig: {
  executionTimeout: cdk.Duration.days(2),
  retentionPeriod: cdk.Duration.days(14),
}

Executions

When a Lambda is configured as a durable function you get a new tab called Durable Executions which shows all executions and when an execution is selected all steps inside that execution:

This shows a list of executions for the function:

Clicking into an execution gives a list of steps that were executed:

Mostly I’ve found this pretty good sometimes, especially when executing in parallel it seems to be incorrect while the execution is going on but once it’s complete then is correct.

The code for this function can be found here with the CDK here — the deployment also includes components that mock the callbacks and allow it to be executed.

I have now built a nice workflow — in part 2 I will look at the real power of durable functions and add in some resilience with retries and replies as well as looking at testing the function.

Workflow orchestration with Lambda Durable Functions — part 1 was originally published in AWS Tip on Medium, where people are continuing the conversation by highlighting and responding to this story.

Localstack in VSCode

Chris Dobson — Sun, 14 Sep 2025 14:38:54 GMT

LocalStack in VSCode

Developing using LocalStack from within VSCode just got a lot easier with the latest update to the AWS Toolkit adding support for LocalStack profiles.

In a previous post I detailed an environment I’ve been using for local development using LocalStack.

One of the tools I mentioned was the AWS Toolkit which at the time was tricky to configure to use LocalStack. To use the toolkit with LocalStack meant editing the settings file to add a line for every service (some of which didn’t work) and, of course, to go back to AWS environment remove those lines. I also couldn’t get Lambda remote debugging to work (although this could easily have been me!)

Well the toolkit has had an update introducing direct support for LocalStack meaning that a single operation can switch between AWS and local environments.

Setting up LocalStack in the toolkit

When the latest version of the toolkit extension has been installed select ‘Walkthrough of Application Builder’

there will be an option to install LocalStack

This will open a dialogue

Setup will authenticate with LocalStack and install the AWS Toolkit integration.

Once that’s installed LocalStack will appear in the VSCode status bar.

Clicking on that will give the option to start LocalStack from within VSCode. When starting LocalStack your license key is retrieved from your account so that you are running the correct version.

One thing I did notice is when using Rancher instead of Docker the extension didn’t update the status of LocalStack to started — I had to restart VSCode to be able to use LocalStack in the toolkit. I believe this has something to do with Rancher implementing notifications differently so on the odd occasion I can’t use Docker then I simply start LocalStack from the cli before starting VSCode.

Once LocalStack is running the status bar updates

And there is a LocalStack profile available to use with the toolkit — click on the current profile in the status bar to get a list of available profiles

Select the profile and the toolkit will be pointing to your LocalStack instance.

Lambda invocation and debugging

There are a few ways to add a Lambda function into your LocalStack instance and have it available for debugging

Create a new SAM Application

A new SAM application can be created from within the toolkit. A right-click on ‘Lambda’ in the toolkit explorer opens a menu containing ‘Create Lambda SAM Application’

Selecting this will go through a number of options eventually creating a new SAM application containing a Lambda function.

Once this has been created it can be deployed to LocalStack from the toolkit ‘Application Builder’. Find the application in the list and right-click to get a menu

Select ‘Deploy SAM Application’ to deploy your Lambda to LocalStack

Copy a function from your AWS environment

Something I’ve found useful is the ability to copy a Lambda function from my AWS environment into LocalStack relatively quickly.

To do this ensure the toolkit is using your AWS profile, find the Lambda you want to copy in the explorer and right-click you will see this menu

Select ‘Convert to SAM Application’ to create a new Lambda SAM Application from the function.

Once complete switch the toolkit to use your LocalStack profile and the application can be deployed in the same way as a newly created one.

Create a Lambda from CDK

This blog post looked at a setup I’ve been using to develop locally — using CDK in that way (or any other way) to deploy a Lambda to LocalStack will make it available for remote debugging.

Once a Lambda function has been added to LocalStack it can be invoked using the option in the toolkit.

Find the Lambda in the toolkit and select the ‘Invoke Remotely’ option

Opens the ‘Remote invoke configuration’ window

Use that to invoke and debug the function as required.

Step functions

This update has also made developing and testing Step Functions locally significantly easier.

The Step Functions visual designer has been available in the toolkit for a while and is excellent for creating and editing from within VSCode and now it can be used to deploy and test in LocalStack.

Simply select the LocalStack profile for the AWS toolkit, go to ‘Step Functions’ in the explorer and either right-click to create a new state machine or find the state machine you want to work with in the list, right-click and select ‘Open with Workflow Studio` to open up the designer.

Once finished designing the ‘Save & Deploy’ option can be used to either create a new state machine or update and existing one in LocalStack.

One thing I’ve found really useful is the ability to easily copy an existing state machine from my AWS environment, deploy it to LocalStack to test and edit locally.

To do that ensure that the toolkit is using the AWS environment, find the state machine in the explorer, right-click and select ‘Download Definition’.

This will download the definition of the state machine into VSCode, select the definition to open it in the designer, switch profiles so that the toolkit is using LocalStack and deploy it. The state machine can then be edited and deployed through the designer at will and, if required, it can be deployed back to the AWS environment by switching profiles back.

All in all so far I’ve found the new LocalStack integration a useful addition to my toolkit, especially the ability to quickly switch between environments to copy resources to LocalStack from AWS.

A local Lambda development environment

Chris Dobson — Sun, 20 Jul 2025 10:02:37 GMT

Sharing some ways in which I streamline my workflow and speed up the feedback loop when developing for AWS Lambda

Two tools I use when developing for AWS serverless in general are LocalStack and the AWS Toolkit for VSCode which has recently been updated to include new features making it more compelling to use with Lambda.

The main way I speed up my feedback loop when developing in AWS Lambda is by using LocalStack. LocalStack is essentially an AWS environment running locally on your machine and I find it really useful for quickly testing many things not just Lambda functions.

It has a free tier which gives access to everything I’ve used in this article so go and sign up!

Setup LocalStack

Once signed up with LocalStack getting going is pretty simple.

LocalStack runs in a docker container and once you’ve followed the steps in the Getting Started guide you should be up and running. Once everything’s running the LocalStack console should look something like this:

To demonstrate a typical setup I use when developing I’m going to go through how I’d create a simple setup containing an SQS queue, a Lambda function and an S3 bucket. Messages on the SQS trigger the Lambda which writes the message that was received into the bucket.

Automate deployment to LocalStack

Generally I would be deploying code using CDK so, given that CDK is installed, I would initialise a new CDK project — in my case I would usually be using Typescript as the language:

cdk init --language typescript

and then create the three components and necessary permissions in the CDK stack:

const bucket = new Bucket(this, "automated-bucket", {
  bucketName: "automated-bucket",
  versioned: true,
  removalPolicy: cdk.RemovalPolicy.DESTROY,
})

const sqs = new Queue(this, "automated-queue", {
  queueName: "automated-queue",
  visibilityTimeout: cdk.Duration.seconds(300),
  retentionPeriod: cdk.Duration.days(4),
  removalPolicy: cdk.RemovalPolicy.DESTROY,
})

const lambda = new NodejsFunction(this, "automated-lambda", {
  entry: "functions/test/src/index.ts",
  handler: "handler",
  runtime: cdk.aws_lambda.Runtime.NODEJS_22_X,
  environment: {
    BUCKET_NAME: bucket.bucketName,
  },
})

lambda.addEventSource(
  new SqsEventSource(sqs, {
    batchSize: 10,
  })
)

sqs.grantConsumeMessages(lambda)

If I wanted to deploy this to an AWS environment I would use the following CDK commands:

cdk bootstrap
cdk deploy

To deploy this to the LocalStack environment I install CDK Local which is a wrapper around CDK that executes the commands on LocalStack. So to bootstrap and deploy:

cdklocal bootstrap
cdklocal deploy

Now everything is deployed to LocalStack which is great, however I’d rather not run a cdklocal command manually every time I’ve updated anything. To automate this deployment I use nodemon to watch for any changes in the stack and run the deploy command. To do that I use the following nodemon config file:

{
  "exec": "cdklocal deploy --require-approval never",
  "watch": [
    "lib",
    "functions"
  ],
  "ext": "ts"
}

This configuration watches for changes in the lib directory which contains the CDK stack and the functions directory which contains the Lambda function and if any changes are made to ts files then runs cdklocal deploy --require-approval never . Approvals are turned off as I don’t want to have to approve any security related changes while I’m developing — I just want to save the file and it gets deployed. This is a local development environment so there shouldn’t be any problems 🤞

Now if I execute nodemon in the root of the project whenever I make any changes to either the stack or the Lambda function they will be deployed to LocalStack for me to test.

This repository setup is here.

This is great and all changes get deployed ready for test however, as you can see from the screenshot, there is a delay of a few seconds especially when changing the Lambda. Given that that’s the component that would be changed most often when developing it would be good to get rid of the delay if possible…

Lambda hot reload in LocalStack

Well it is possible, using LocalStack’s Lambda hot reload 🔥

To use hot reloading create a Lambda to retrieve it’s code from an S3 bucket called hot-reload with a key of the path to the code on your local machine, LocalStack will then update the function every time there’s a change to files in the specified path.

Note — there’s no need to actually create the hot-reload bucket.

To use this I would usually add the Lambda into the CDK stack using hot reload:

   new Function(this, "hot-reloading-function", {
      runtime: cdk.aws_lambda.Runtime.NODEJS_22_X,
      functionName: "hot-reloading-function",
      handler: "index.handler",
      code: Code.fromBucket(
        Bucket.fromBucketName(this, "hot-reload", "hot-reload"),
        "path-to-repository/functions/test/dist"
      ),
    });ty

Note the name of the bucket and the key in the call to Code.fromBucket .

Then remove the functions directory from the nodemon config so that it only deploys when the stack is changed:

{
  "exec": "cdklocal deploy --require-approval never",
  "watch": [
    "lib"
  ],
  "ext": "ts"
}

Add another nodemon config into the Lambda directory that builds it whenever it’s changed:

{
  "exec": "npm run build",
  "ext": "ts"
}

Then run nodemon into the Lambda directory as well as in the root of the project so that if the function code is changed the Lambda is hot reloaded while if the stack is changed then a full deploy happens.

This repository setup is here.

AWS Toolkit for VSCode

As a VSCode user I also often use the AWS Toolkit for VSCode for a number of things — especially the local Workflow Studio for editing Step Functions.

Recently (17th July 2025) AWS released an update that adds two new features that I would expect to become very useful when developing for Lambda:

Console to IDE integration which allows you to open a Lambda function in VSCode directly from the AWS console and then deploy any changes directly from the IDE.
Remote debugging which allows you to debug a Lambda function that is running in an AWS environment from within VSCode

Both features are pretty simple to use…

Console to IDE integration

There is now a new button on the Lambda console — Open in Visual Studio Code

which does exactly what it says and allows you to open and edit the function in the IDE and gives the option to deploy when changes have been made.

This can also be initiated from within the IDE — find the function you want to edit in the Explorer and select the ‘Download…’ option, the function is then downloaded and can be edited and deployed.

Remote debugging

Once the Lambda is available in the IDE which can be done using either of the two techniques above it can be debugged remotely. To do this right-click on the Lambda in the explorer and select ‘Invoke Remotely’, or click the arrow button, the ‘Remote invoke configuration’ dialog is opened.

This dialog includes a checkbox for ‘Remote debugging’, tick that box, add a breakpoint into the function code, hit the ‘Remote invoke’ button and the Lambda will be invoked in the AWS environment but the breakpoint you’ve added will be hit in the IDE 🎉

Use AWS Toolkit with LocalStack

Wouldn’t it be cool if these two new Lambda features could be used within LocalStack?

Well, there’s good news 😃 and bad news 😦!

The good news is that the IDE integration seems to work, however I’ve not been able to get the remote debugging to work as yet (if I do I’ll update this post!)

It’s a little bit fiddly to setup as the toolkit doesn’t expose any configuration to use different endpoints so you need to edit the VSCode settings.json and add the following:

"aws.dev.endpoints": {
  "lambda": "http://localhost.localstack.cloud:4566"
}

this tells the toolkit to use the LocalStack endpoint for Lambda meaning that the IDE integration can be used on LocalStack as well.

A local Lambda development environment was originally published in AWS Tip on Medium, where people are continuing the conversation by highlighting and responding to this story.

Keeping supporters up to date using AWS Events and Web Push (part 4)

Chris Dobson — Sun, 23 Feb 2025 14:54:35 GMT

Adding Web Push notifications

This is the final of a series of articles showing how I built a solution to keep supporters of a sports club up to date using an event-driven architecture in AWS and Web Push notifications.

Keep supporters up to date as much as possible during the games
Use an event-driven architecture in AWS
Require no manual intervention (I’ve better things to do at the weekend)
Minimal cost as it’s running in my AWS account

Part 4

Now that the latest score updates are being published to an SNS topic I want to push notifications to any devices that have subscribed to receive them.

No one wants to receive push notifications twice a minute so I decided to only send notifications for the following:

Every 10 overs
Every wicket
Every player milestone
The result

I also needed a way for users to register/unregister to receive these notifications.

Subscription API

To create and delete subscriptions I use an API Gateway HTTP API with a single route with POST, PUT and DELETE methods which expect a body which is validated using this Zod schema:

import { z } from 'zod';

const SubscriptionSchema = z.object({
  endpoint: z.string(),
  keys: z.object({
    p256dh: z.string(),
    auth: z.string(),
  }),
});

I created a new Dynamo table to store the subscriptions with a partition key of the endpoint URL and the POST and PUT methods make a PUT request to the table passing the validated body.

The DELETE method makes a request to the table to delete the endpoint URL passed as a parameter.

The methods are all integrated with a single Lambda which uses the httpMethod property of the event:

export const handler = async ({ body, httpMethod }) => {
  const validateResult = validateSubscription(JSON.parse(body));
  if (!validateResult.success) {
    return { statusCode: 400, body: JSON.stringify(validateResult.error) };
  }

  const { data: subscription } = validateResult;
  switch (httpMethod) {
    case 'POST':
      await subscribe(subscription);
      break;
    case 'DELETE':
      await unsubscribe(subscription.endpoint);
      break;
    case 'PUT':
      await update(subscription);
      break;
  }
  return { statusCode: 200 };
};

Finally, I wanted some security on this API and decided a simple API key would suffice. Each route is secured using the same simple authoriser Lambda which tests the header against the expected key:

export const handler = async ({ headers: { authorization } }) =>
    ({ isAuthorized: authorization === process.env.API_KEY });

Creating notifications

As previously mentioned I didn’t want to send every single update so I needed to create a buffer that received the updates from the SNS and only sent a web push notification when needed.

I created a Lambda that subscribes to the SNS topic, compares the data with what was sent for the previous notification and, if a new notification is due, creates that notification and adds it to an SQS. The code that checks whether a notification is required is here.

Once I had created the SQS for sending notifications I decided it would be useful to send a confirmation notification whenever a new subscription is received so I updated the Subscription API to put a notification onto the queue. This way a user can confirm everything is working as it should when they subscribe, rather than realising no updates are coming through during a game.

Sending notifications

To use web push I need to create a Vapid key-pair. To do this I used this tool.

To send the notifications I used the web-push library and get the Vapid details from the environment:

const vapidSubject = `${process.env.VAPID_SUBJECT}`;
const vapidPublicKey = `${process.env.VAPID_PUBLIC_KEY}`;
const vapidPrivateKey = `${process.env.VAPID_PRIVATE_KEY}`;

webpush.setVapidDetails(vapidSubject, vapidPublicKey, vapidPrivateKey);

Then I needed to get the subscriptions from the Dynamo table, I did this using a Scan which should be fine for the foreseeable future but for a large amount of subscriptions may not scale very well — will cross this bridge if I get there.

Once I have all of the subscriptions I can send them using sendNotification:

await webpush.sendNotification(
    subscription,
    JSON.stringify({ title: 'Hello', body: 'Hello from live-scores' })
);

Expired subscriptions

Although there is an API call that can be made to delete a subscription sometimes that might not be called when a subscription is disabled or the subscription may expire.

When the sendNotification is called for an expired subscription an exception is thrown. As these subscriptions cannot become active again (as far as I know) then it makes sense to delete them from the table so as not to waste resources trying to send to them every time.

As I already had the code written to delete a subscription from the API it made sense to try and re-use it. Rather than call the API for each expired subscription I decided to create an SQS queue with a new Lambda to read from it that deletes the subscription from the table. Both the API DELETE method and the code that handles expired subscriptions then put a message on the queue:

import webpush, { PushSubscription, WebPushError } from 'web-push';

const send = (removeSubscription: RemoveSubscription, notification: string) => async (subscription: PushSubscription) => {
  try {
    await webpush.sendNotification(subscription, notification);
  } catch (e: unknown) {
    if (e instanceof WebPushError &&
        (e.body.includes('unsubscribed') || e.body.includes('expired'))) {
      removeSubscription(subscription.endpoint);
    } else {
      console.error(e);
    }
  }
};

The architecture looks like this:

Subscribing/unsubscribing from a web client

To use web push from the client I first needed to check that it is supported in the browser which I did using this expression 'serviceWorker' in navigator && 'PushManager' in window && 'showNotification' in ServiceWorkerRegistration.prototype

Once I’ve established that it is supported I use PushManager.subscribe passing in the public Vapid key which will create a new subscription object for the website. To retrieve an existing subscription I use PushManager.getSubscription. This object matches the Zod schema that it used to validate the subscription API calls.

Then it’s a question of either making a POST or DELETE request to the API depending on whether the subscription is being created or deleted. The code I use to subscribe from a Remix website is here.

This also does some extra checks when running in IOS as if the os version is 16.4 or over and web push is not supported then if the user adds the site to their home screen then it should be available.

The code for this article can be found on this branch and the full service is here.

Keeping supporters up to date using AWS Events and Web Push (part 3)

Chris Dobson — Sun, 09 Feb 2025 14:10:37 GMT

Adding WebSocket push notifications

This is part 3 of a series of articles showing how I built a solution to keep supporters of a sports club up to date using an event-driven architecture in AWS and Web Push notifications.

Keep supporters up to date as much as possible during the games
Use an event-driven architecture in AWS
Require no manual intervention (I’ve better things to do at the weekend)
Minimal cost as it’s running in my AWS account

Part 3

Now that the latest score updates are being published to an SNS topic I needed to update the club website through a WebSocket to keep the score page up-to-date. Before being able to send the updates though there needed to be a way for the page to connect to the WebSocket.

Creating a WebSocket API

First I needed to use API Gateway to create a WebSocket API. In this project the infrastructure is created using Terraform and the code to create the API is here but, as always, a WebSocket API can also be created through the AWS console or CLI.

In this instance, the socket connection was only going to be used to send messages, not receive anything, so I only needed the $connect and $disconnect routes.

Connecting

I added the $connect route and integration with a Lambda, socket-connect, when the $connect route is selected in the console the detail looks like this:

The socket-connect Lambda receives an event as a parameter that includes a requestContext.connectionId value which is the unique id that can be used to send a message on the WebSocket. I created a new Dynamo table to store the connection ids as its partition key along with a TTL of 24 hours to keep the table clean.

The code for this Lambda is fairly straightforward, it gets the connectionId and performs a PUT on the Dynamo table of the value along with the expiry field which is configured as the TTL:

import { DynamoDBClient } from '@aws-sdk/client-dynamodb';
import { DynamoDBDocumentClient, PutCommand } from '@aws-sdk/lib-dynamodb';

const client = new DynamoDBClient({});
const documentClient = DynamoDBDocumentClient.from(client);
const TableName = 'cleckheaton-cc-live-score-connections';
export const handler = async event => {
  const { connectionId } = event.requestContext;
  await documentClient.send(
    new PutCommand({
      TableName,
      Item: {
        connectionId,
        expiry: Math.floor(Date.now() / 1000) + 24 * 60 * 60,
      },
    }),
  );
  return { statusCode: 200 };
};

Disconnecting

Similarly, I added a $disconnect route and integration with a Lambda called socket-disconnect:

The socket-disconnect Lambda also receives an event with requestContext.connectionId as a parameter. This function needs to delete the record from the Dynamo table:

import { DynamoDBClient } from '@aws-sdk/client-dynamodb';
import { DynamoDBDocumentClient, DeleteCommand } from '@aws-sdk/lib-dynamodb';

const client = new DynamoDBClient({});
const documentClient = DynamoDBDocumentClient.from(client);
const TableName = 'cleckheaton-cc-live-score-connections';
export const handler = async event => {
  const { connectionId } = event.requestContext;
  await documentClient.send(new DeleteCommand({ TableName, Key: { connectionId } }));
  return { statusCode: 200 };
};

Sending

When the updated scorecard JSON is received from the SNS topic it needs to be sent to every connection stored in the Dynamo table. For the foreseeable future, a single Scan should be sufficient to get all the data although if traffic increases significantly then this may not scale too well and will have to be looked at again.

Once all the connections have been retrieved then the JSON is sent to each of them:

import { ApiGatewayManagementApiClient, PostToConnectionCommand } from '@aws-sdk/client-apigatewaymanagementapi';
import { Scorecard } from '@cleckheaton-ccc-live-scores/schema';

const apiGatewayClient = new ApiGatewayManagementApiClient({ region: 'eu-west-2', endpoint: `${process.env.SOCKET_ENDPOINT}` });
const sendScorecard = (scorecard: Scorecard) => async (connectionId: string) => {
  const command = new PostToConnectionCommand({
    ConnectionId: connectionId,
    Data: Buffer.from(JSON.stringify(scorecard)),
  });
  return apiGatewayClient.send(command);
};

The architecture looks like this:

Connecting from the web client

To connect to the socket from the web client I create a new instance of the WebSocket class passing the URL of the API to the constructor.

To receive events from the socket use the addEventListener method passing the name of the event and a listener function. The message event is the important one for receiving the latest scorecard and in the handler for that event I dispatch a custom event - this event has the name of the team the update refers to and the detail property is the JSON received in the message. These events can then be handled on any page that needs an up-to-date score:

const teamEventName = {
  firstTeam: 'firstTeamScoreUpdate',
  secondTeam: 'secondTeamScoreUpdate',
};

const registerForScorecardUpdates = () => {
  const socket = new WebSocket(`${window.ENV.UPDATES_WEB_SOCKET_URL}`);
  socket.addEventListener('open', () => {
    console.log('Connected to updates web socket');
  });
  socket.addEventListener('message', event => {
    console.log('received', event.data);
    const scorecard = JSON.parse(event.data);
    window.dispatchEvent(new CustomEvent(teamEventName[scorecard.teamName as 'firstTeam' | 'secondTeam'], { detail: scorecard }));
  });
};
export { registerForScorecardUpdates };

The code for this article can be found on this branch and the full service is here.

Coming next

Part 4 will show how the updates are pushed to Web Push.

Keeping supporters up to date using AWS Events and Web Push (part 2)

Chris Dobson — Sun, 26 Jan 2025 19:27:50 GMT

Parsing the HTML and using an AWS fan-out architecture to publish the results

This is part 2 of a series of articles showing how I built a solution to keep supporters of a sports club up to date using an event-driven architecture in AWS and Web Push notifications.

The first part can be found here.

Goals

Keep supporters up to date as much as possible during the games
Use an event-driven architecture in AWS
Require no manual intervention (I’ve better things to do at the weekend)
Minimal cost as it’s running in my AWS account

Part 2

Now I have the raw HTML for the current score on an SQS queue it needs parsing to create a JSON representation of the scorecard which will be used to make several updates:

JSON object in an S3 bucket
WebSockets
Web Push notifications
Update the result in CMS when the game is over
Teardown EC2 instance when the game is over
In the near future create a ChatGPT match report when the game is over

This article will deal with updating the S3 bucket and the game over updates — WebSockets and Web Push will be covered in parts 3 & 4.

Creating the scorecard JSON

To create the scorecard I created a Lambda, create-scorecard, that reads the HTML from the SQS and uses Cheerio to parse it and create an object. Once the object has been created it is validated using Zod against this schema:

import { z } from 'zod';

const BowlingFiguresSchema = z.object({
  name: z.string(),
  overs: z.string(),
  maidens: z.string(),
  runs: z.string(),
  wickets: z.string(),
  wides: z.string(),
  noBalls: z.string(),
  economyRate: z.string(),
});
const PlayerInningsSchema = z.object({
  name: z.string(),
  runs: z.string(),
  balls: z.string(),
  minutes: z.string(),
  fours: z.string(),
  sixes: z.string(),
  strikeRate: z.string(),
  howout: z.array(z.string()),
});
const InningsSchema = z.object({
  batting: z.object({
    innings: z.array(PlayerInningsSchema),
    extras: z.string(),
    total: z.string(),
    team: z.string(),
  }),
  fallOfWickets: z.string(),
  bowling: z.array(BowlingFiguresSchema),
});
const ScorecardSchema = z.object({
  url: z.string(),
  teamName: z.string(),
  result: z.string().nullable(),
  innings: z.array(InningsSchema),
});

This schema can then be used by any downstream service to validate that the message received is a valid scorecard.

Making the updates

Once the JSON had been created the various updates needed to be made. This could easily be done from the same Lambda that creates it however I like my Lambda functions to adhere to the Single Responsibility Principle as much as I can and only perform one function. This means creating a single Lambda to perform each update (update S3, teardown the EC2 instance, update the CMS) and pass the JSON to each one.

AWS SNS allows the JSON to be published to a topic and multiple Lambdas (amongst other services) to subscribe to the topic. So I created a scorecard-updated topic and added a subscription for each of the Lambda functions and the create-scorecard Lambda publishes the JSON to the topic.

Each Lambda validates the incoming message using the Zod schema defined in the previous section in case rogue messages have been published to the topic.

Update S3

The update-bucket Lambda updates the S3 Object, it simply creates a key for the object based on the team and the match date and PUTs the object using that key. The PUT operation will either create a new object or overwrite an existing one.

Update CMS

The update-sanity Lambda checks if the match has a result and, if it does, makes an update to Sanity. The fixtures should already exist in Sanity so firstly it makes a query to find the fixture based on team and date and, if one is found, makes a PATCH request for the fixture setting the result.

Teardown EC2 Instance

The update-processors Lambda also checks if the match has a result and, if it does, checks if the EC2 instance needs terminating. When the EC2 instance is created two tags are added, firstly the Owner which is set to cleckheaton-cc and secondly InProgress which is a count of the number of matches currently being processed by the instance.

This Lambda finds the instance with and Owner of cleckheaton-cc and checks the InProgress count, if the count is 1 then, as the match has now been completed the instance can be terminated. If the count is > 1 then the count is decreased by 1 and the tag is updated.

There is still a Lambda that will run at 9 pm to terminate any instances that are running that will catch anything that is missed by this process but it may save the odd cent here and there!

The components added so far look like this:

Refactor

Of the three Lambda functions I just added two of them only do any updates if the match is complete (has a result). This made me think that it might be worth adding a new SNS topic that publishes only matches that are complete which the update-processors and update-sanity Lambdas subscribe to instead.

So I created a new Lambda game-over that subscribes to the scorecard-updated SNS and checks whether the match has a result or not and if there is a result publishes to the game-over topic. This check can then be removed from update-processors and update-sanity which now subscribe to the game-over topic.

The architecture now looks like this:

The code for this article can be found on this branch and the full service is here.

Coming next

Part 3 will show how the updates are pushed to WebSockets.

Keeping supporters up to date using AWS Events and Web Push (part 1)

Chris Dobson — Wed, 22 Jan 2025 14:13:28 GMT

How I built a solution to keep supporters of a sports club up to date using an event-driven architecture in AWS and Web Push

When I was asked to create a new website for the cricket club I played for I figured it would be something I could use to experiment with. Given that my UI skills leave a lot to be desired I didn’t want to do much on the front end so I decided to include up-to-date live scores for both our teams every weekend and allow users to subscribe to receive updates via Web Push to their device during the games.

Goals

Keep supporters up to date as much as possible during the games
Use an event-driven architecture in AWS
Require no manual intervention (I’ve better things to do at the weekend)
Minimal cost as it’s running in my AWS account

Part 1

The first step was to find the raw data and make it available. The live scorecards for each game in the league are available on separate web pages so I needed to find the pages for games involving my club and scrape the latest HTML from them.

Finding the scorecards

Some page interaction was needed with the page before the links were available so Puppeteer seemed like the best choice to open up the page.

It made sense to run this process as a lambda as it should run in under a minute and only once. This sounded fine until trying to deploy to AWS — to cut a long story short Puppeteer needs Chrome binaries and once bundled into a zip file the whole thing is far larger than the 50Mb limit for Lambda. So I ended up creating a lambda layer from chrome-aws-lambda (which is just under 50Mb), removing it from the bundle and referencing it through the layer.

The URLs are stored in Dynamo and the Lambda is invoked by an EventBridge rule every 30 minutes between 12.00 and 14.00 — no games will start before midday so there’s no point running it earlier and it runs more than once in case there’s been a problem with the page. With the URLs being stored in Dynamo it’s easy to check if they already exist at the beginning of the function and return early if they do. Also, these URLs will not be required for any more than around 9 hours so a TTL is added for 24 hours to keep the table clean.

Reading the raw data

Once I have a URL for a game I then need to navigate to the correct tab for the scorecard, extract the relevant HTML, and do something with it.

Again Puppeteer turned out to be my friend except this time it made no sense running as a Lambda. I wanted to get the scores every 5 minutes at the very least, ideally every 20 seconds, to be as up-to-date as possible so it was far more sensible to run the process inside an EC2 instance.

Because the page updates itself, presumably via a WebSocket, I can open up the webpage, navigate to the scorecard tab, every 20 seconds scrape the HTML, and if it’s changed put it on an SQS for processing downstream.

Putting it together

Once the URLs have been added to Dynamo I needed to create an EC2 instance that, when started, would clone the git repo, install dependencies and invoke a process for each of the games that are in progress.

Enabling streams on the Dynamo table and invoking a Lambda function is probably the best choice for this.

Once the lambda receives an insert event from the stream it creates an EC2 instance using the AWS SDK and sets the userdata of the instance to install git, install node, clone the repo, run npm ci and invoke a process for each game that is being played.

const USER_DATA = `#!/bin/bash
yum update -y
yum install -y git
yum install -y pango.x86_64 libXcomposite.x86_64 libXcursor.x86_64 libXdamage.x86_64 libXext.x86_64 libXi.x86_64 libXtst.x86_64 cups-libs.x86_64 libXScrnSaver.x86_64 libXrandr.x86_64 GConf2.x86_64 alsa-lib.x86_64 atk.x86_64 gtk3.x86_64 ipa-gothic-fonts xorg-x11-fonts-100dpi xorg-x11-fonts-75dpi xorg-x11-utils xorg-x11-fonts-cyrillic xorg-x11-fonts-Type1 xorg-x11-fonts-misc
curl -sL  | sudo bash -
yum install -y nodejs
git clone 
cd cleckheaton-cc/live-scores/scorecard-processor
npm ci
`;

const getStartCommand = ({ teamName, scorecardUrl }: ScorecardUrl) => `npm start ${scorecardUrl} ${process.env.PROCESSOR_QUEUE_URL} ${teamName}`;
const createInstance = (scorecardUrls: ScorecardUrl[]) => {
  const userData = `${USER_DATA} ${scorecardUrls.map(getStartCommand).join(' & ')}`;
  const command = new RunInstancesCommand({
    ImageId: 'ami-0d729d2846a86a9e7',
    InstanceType: 't2.micro',
    MaxCount: 1,
    MinCount: 1,
    KeyName: 'test-processor',
    SecurityGroupIds: [process.env.PROCESSOR_SG_ID as string],
    IamInstanceProfile: { Arn: process.env.PROCESSOR_PROFILE_ARN },
    UserData: Buffer.from(userData).toString('base64'),
    TagSpecifications: [
      {
        ResourceType: 'instance',
        Tags: [
          { Key: 'Owner', Value: 'cleckheaton-cc' },
        ],
      },
    ],
  });
  return client.send(command);
};

Originally I had a single EC2 instance per game but, obviously, a single instance will halve the cost!

On the subject of cost, I now have an EC2 instance running and, currently, the only way of terminating it is manually. Instead of relying on myself to both remember to terminate it and to have access to the AWS console to do it every weekend, I added another Lambda function that would search for any instances with the Ownertag set to cleckheaton-cc and terminate them. I can happily invoke this from an EventBridge rule scheduled at 9 pm as it’s very unlikely a game will be going on after that time.

export const handler = async () => {
  const command = new DescribeInstancesCommand({
    Filters: [{ Name: 'tag:Owner', Values: ['cleckheaton-cc'] }],
  });

  const instances = await ec2Client.send(command);
  const instanceIds = instances.Reservations.flatMap(({ Instances }) =>
    Instances?.map(({ InstanceId }) => InstanceId)
  ).filter(Boolean) as string[];
  const terminateCommand = new TerminateInstancesCommand({
    InstanceIds: instanceIds,
  });
  await ec2Client.send(terminateCommand);
};

So at this point, I’m finding the matches that are being played on a particular day, reading the scorecard html for each game every 20 seconds and putting it onto an SQS queue for processing downstream. Also, everything is being cleaned up — the scorecard URLs will be deleted from the table and the EC2 instance will be terminated.

The architecture looks like this:

The code for this article can be found on this branch and the full service is here.

Coming next

Part 2 will show how the HTML is processed downstream and the fan-out pattern is used to make several updates.

An adventure in testing Step Functions part 2

Chris Dobson — Sun, 08 Dec 2024 20:11:13 GMT

Time travel and mocking.

In part 1 I looked at testing a Step Function through the TestState API without having to deploy anything. I ended up with a fairly simple library that could test a full state machine that doesn’t include parallel and map states. Next I looked at how I could add features that make testing Step Functions easier…

Wait states

One of the challenges with testing Step Functions is if the state machine includes a Wait state. Many workflows include a requirement to wait for minutes, hours, days, weeks and in order to test the deployed state machine it is often necessary to pass in a flag to indicate this is a test and that the Wait state should not actually wait. What I really need is a flux capacitor!

In the absence of Doc Brown testing though the TestState API allows me to transform any of the states relatively easily. So a tiny bit of code to check if the state is a Wait state and if it is set the number of seconds to 0 proves that time travel is actually possible:

  switch (stateDefinition.Type) {
    case "Wait":
      return { ...stateDefinition, Seconds: 0 }

Who needs a flux capacitor when I’ve got a switch statement?

Mocking a full state

A challenge in testing any code that interacts with another service is that I may not be able to control the other service and, therefore, cannot cover all cases that I may want to test. Also many workflows will involve reading or writing records to a database — often in this case I can add test records into the database and clean up afterwards but it’s not unheard of for inserting a record to start off a new workflow. For these and other instances I will often want to mock a result without making calls to the service.

Using the TestState API to mock single states that are not deployed makes mocking a full state relatively easy. I have implemented a similar interface to the one used by jest and vitest where the test code can set up a mock for a state. For instance this sets up a mock output for the test-state state:

mockState("test-state", { output: { description: "This output is mocked!" } })

The test runner will then check for any registered mocks for a state and if there are rather then calling the API to execute the state will return the result supplied when the mock is setup:

export const mockedResult = (stateName: string, nextState?: string) => {
  const mocks = stateMocks.get(stateName)

  if (!mocks) {
    return null
  }

  const [mock] = mocks

  return {
    error: mock.error,
    status: mock.error ? ("FAILED" as const) : ("SUCCEEDED" as const),
    nextState: mock.nextState || nextState,
    output: mock.output,
  }
}

In this way I am able to bypass calling any services that there could be a problem using.

Mock service responses

Bypassing the execution of a whole state is all well and good but a state will often include steps to transform responses that are returned from a service. If the full state is being mocked then these transforms cannot be tested. What I needed to do was to execute the state but return a specified response from the task that is called. Neither the TestState API or the Step Function service can do this.

In the absence of a native way of doing this some creative thinking was required — some may say hacking but I’m sticking with my description.

As with the Wait state I have taken advantage of the fact that the state isn’t deployed anywhere and can be transformed in a test. In this case I deployed a Lambda Function that simply returns whatever is passed in:

export const handler = async ({ mockResult }) => {
  return mockResult;
};

In the same way as when mocking a full state the test sets up the service responses for specific states.

This sets up a response for an HTTP task:

mockResponse("test-state", { response: {
    ResponseBody: {
      name: "Chris Dobson",
    },
    StatusCode: 200,
    StatusText: "OK",
  }
})

This sets up a response for a DynamoDB GetItem task:

mockResponse("test-state", { response: {
    Item: {
      id: {
        S: "user-1",
      },
      name: {
        S: "Chris Dobson",
      },
    },
  }
})

When the test runner is required to mock a response the state is transformed into a Lambda task with a target of the function created above:

export const transformState = (stateName: string, stateDefinition: TestSingleStateInput["stateDefinition"]) => {
  const mocks = responseMocks.get(stateName)
  if (!mocks || !mocks.length) {
    return stateDefinition
  }

  const [mock] = mocks

  return {
    ...stateDefinition,
    ...updateOutputs(stateDefinition, mock.response),
    Resource: "arn:aws:states:::lambda:invoke",
    Parameters: {
      FunctionName: mockFunctionName,
      Payload: { mockResult: mock.response },
    },
  }
}

This then will return the required response when the state is executed through the API and tests the rest of the state.

The library includes functions that will create and delete the step function that can be called before and after testing.

The library

The library I’ve created is called step-by-step (any suggestions for a better name gratefully received 😀) and can be found on Github and npm.

In the next post in this series I will look at supporting Map and Parallel states and testing failure.

An adventure in testing Step Functions

Chris Dobson — Sat, 12 Oct 2024 15:02:17 GMT

Is it possible to get good test coverage of a Step Function? I am attempting to create a library using the TestState API that will allow me to test my Step Functions more thoroughly.

Recently I needed to debug a Step Function and found myself:

editing the state in the AWS console
hitting the Test state button to open the test window

setting the inputs, executing the state
copying the output and next state that was returned

going back to the function and finding the next state
hitting the Test state button
pasting the output as the input and so on….

Obviously as a developer I figured why would I spend 15 minutes manually doing this when I could spend a much more interesting 90 minutes writing some code to automate it 😀

So I set about using the TestState API to automate the process and noticed something I didn’t expect — I expected I’d need to pass the state machine ARN and the name of state to test but the API accepts the state as JSON instead meaning that the function doesn’t need to be deployed in order to test a single state.

💭 Given that the API returns both the output and the next state to be executed I figured it ought to be possible to write some code that accepts the full function definition and an initial input and executes each state. Occasionally it can also be desirable to execute a subset of the states in a function and by writing code using the TestState API this also ought to be possible.

Currently testing a full state machine can be tricky and I find the approach tends to depend on what the function is doing which isn’t ideal. The main two options seem to be:

Deploy to AWS

Deploying to AWS ensures that the test is running in the actual environment that it will be running on in production and is generally good for testing the ‘happy path’. Depending on the AWS environment deployed to you may need to clean up after yourself if some of the side effects of the function create data, if you are able to test using an ephemeral environment then this shouldn’t be a problem.

Testing some paths, especially errors, can be difficult — for instance influencing the response of a third party API is impossible and making an AWS service with say five 9s uptime fail relies on running the test in the 6 minutes it may be down. If I’m unit testing some code these responses can be mocked but, obviously, that can’t be done in a real AWS environment.

Deploy locally

I find Localstack an invaluable tool when developing and some of the features I find invaluable can also be applied to testing Step Functions. For instance unlike the real AWS environment service responses can be mocked to help get to those hard to reach states. A disadvantage though is that however good it is (and it is very good) it isn’t an actual AWS environment so could potentially contain bugs (different bugs to the real AWS environment) and may not keep completely up to date with AWS — as we know the cadence of AWS introducing new features can be very high and event though they do a fantastic expecting Localstack to keep up is a bit much!

Both of these options have advantages and disadvantages but a couple of things can’t be done using either.

Firstly neither have an option to ‘time travel’ — by which I mean that if a Wait state is used there is no option for skipping the time so if I have a state that waits for a month then I have to wait for a month. This can be mitigated by passing the time to wait as a parameter to the function but this means including code specifically for testing which while sometimes can’t be avoided I prefer to avoid if I can.
Secondly they both involve deploying somewhere which may make it difficult or at the very least time consuming to include in a CI pipeline.

Could an approach using the TestState API improve things?

I think the approach I stumbled across when automating my debugging process could potentially improve a few things:

Using the TestState API doesn’t required the state machine to be deployed
Wait states can be handled by the code executing the test setting the delay rather than specifically including test code
Full states and service responses can be mocked in the code executing the test

It seemed to me that it might be worth creating a library that uses these ideas and see if it is, in fact, of any use. So I’m creating a library (Typescript/Javascript as that is what I use most) and writing about it as I go.

Initially I implemented execution of a single state and full state machine without any of the bells and whistles discussed above — they are coming soon!

Executing a single state

Execution of a single state was just a question of creating a wrapper around the AWS SDK:

const result = await client.send(
  new TestStateCommand({
    definition: JSON.stringify(stateDefinition),
    roleArn: process.env.AWS_ROLE_ARN!,
    input: JSON.stringify(input),
  })
)

return {
  error: result.error ? { message: result.error!, cause: result.cause! } : undefined,
  status: result.status,
  nextState: result.nextState,
  output: result.output ? JSON.parse(result.output) : undefined,
}

Executing a full state machine

Each state machine definition includes a StartAt property which, as the name suggests, is the name of the first state to be executed and states that complete the state machine will have a property of End: true. So executing a full state machine involves finding the first state to execute and recursively testing each state until hitting a state that completes the function. I also keep a record of the ‘call stack’ of states:

const execute = async ({
  functionDefinition,
  input,
  state,
  stack = [],
}: TestFunctionInput & {
  state: string
  stack?: TestFunctionOutput["stack"]
}): Promise => {
  const stateDefinition = functionDefinition.States[state]
  const result = await testSingleState({ stateDefinition, input })
  const updatedStack = [...stack, { ...result, stateName: state }]

  return stateDefinition.End
    ? { ...result, stack: updatedStack }
    : execute({ functionDefinition, input: result.output, state: result.nextState!, stack: updatedStack })
}

The library

The library I’ve created is called step-by-step (any suggestions for a better name gratefully received 😀) and can be found on Github and npm

Next I will be adding time travel support for the Wait state and investigating how I might handle mocking…

An adventure in testing Step Functions was originally published in AWS Tip on Medium, where people are continuing the conversation by highlighting and responding to this story.