Flow by Hamish Weir

I am building a product that runs on AWS. I am currently solo and that means I do it all: development, delivery, and operations. The main reason I built the application using serverless technology was to offload as much of the operational workload to AWS as possible.

An interesting thing has happened on my journey with Serverless:

I integrated DevOps principles into my application delivery workflow like never before. Today I’d like to share some of that story, which should also shed some light on why I’ve been a bit quiet.

The k9 Security application performs a detailed analysis of who has access to what data in AWS. The application does the bulk of this work overnight and delivers its output to customers in time for a morning review.

My daily routine is:

  1. Wake up and check work email for alerts that something needs attention, 30 seconds
  2. Get ready for the day
  3. Review operational dashboard for issues that need attention or charts that need improvement, 5-15 minutes
  4. Add needed improvements to operational work into the backlog and prioritize
  5. Implement product improvements

Every day.

The distance travelled to make an operational change that improves the reliability of the product or incorporates customer feedback feels a heck of a lot shorter — and fun.

I’ve never felt more DevOps than when I was Serverless.

A couple examples…

I added a much-needed caching mechanism to the application on top of DynamoDB in about three days. This made the existing process much more reliable and deterministic.

I scaled out the application’s processing model to handle 25x more work using a couple of Step Functions and even more caching in less than a week. Details on that follow if you’re interested or skip to the end.

2 Steps to 10x

The application needs to partition and manage work that builds a cache across two main dimensions, customers and each customer AWS account. There are a few interesting operational properties to this problem:

  • Working set: work to be done for each account regularly varies by more than 10x across customer AWS account, and even day to day
  • Performance: while the problem is embarrassingly parallel, it’s subject to a strict rate limit that limits concurrency to 1 analysis process for each customer AWS account
  • Security: need to partition and encrypt data flows by customer
Design v1

The initial design used SNS to distribute work to an SQS queue for each account, which would trigger the application’s analysis function. And I knew there would need to be a a bunch of failure handling and retry logic. My biggest concern was writing the code to manage all the customer and account-specific resources, particularly deprovisioning resources. I guesstimated 2 weeks for my initial design. Meaning I didn’t really know how long it would take (pretty sure it would have been more than that now).

I came up with the Step Function based approach when challenging my design with “what’s the simplest thing that could possibly work?”

With Step Functions, I expressed the process in 130 lines of yaml using two Map states. The first step function (aka state machine) identifies the accounts to analyze, maps over those items and triggers an instance of a second step function that actually does the work. The second state machine identifies all the work to be done for an account, and maps through the analysis step work item by item. (Note: I realize AWS Step Functions is an instance of a Workflow Orchestration tool, not a whiz-bang innovation. I like boring.)

While I first had difficulty with the States language, I’m coming around to it and there’s no arguing with the results. I worked out the core of the Step Function-based approach in an afternoon experiment and never looked back.

The Step Function Map state is super-powerful once you get a handle on it. The Map state provides workflow developers a classic map higher-order function to apply a Lambda function to each item in a collection, with a configurable concurrency. The work items are produced by another function or queue. Developers declare failure and retry handling for each application (invocation) of the target function. This means developers and operators get failure isolation and checkpointing for each item of work, as long as the function is idempotent.

In the end, the work to make additional data cacheable for the analysis took longer than the workflow orchestration piece.

Thanks for making me a 5x-10x developer, AWS Step Functions.

More Importantly

More importantly, serverless has helped me focus on product development and application operations. This has freed me up to deliver many small product, delivery, and operational improvements.

  • Operations is priority #1 and dashboards are reviewed at least once per day, which feeds the backlog
  • The app uses native AWS functionality, particularly ‘serverless’ data sources like DynamoDB and S3 wherever possible; I’m not directly managing clusters of anything
  • Serverless Framework supports defining all kinds of AWS infra right in the serverless.yml using CloudFormation
  • I’m the ‘decision maker’ so no trouble getting permission
  • Delivery time from dev to prod is (still) less than 30 minutes because improving daily work is also a priority and gets regular attention

So I have the capability and autonomy to make changes, and I get fast feedback on how those changes are working.

This is the best product delivery cycle I’ve ever had, even though I’m solo.

I think this is a great setup for ‘full stack’ delivery teams. Serverless technology makes a full stack, fully shared, delivery model accessible to more teams. Particularly app teams who want to migrate to the Cloud but don’t have a good story around who’s going to handle operations. Serverless approaches provide at least one answer, and it’s getting better each day. If you’re curious about this, feel free to reach out — I’m happy to discuss.