Reading time: 2.5 minutes

We create and review designs primarily in our minds. While our expertise is worth a lot, we’re not perfect and neither are our diagrams. We’re still subject to:

  • known-knowns we think are true, but aren’t
  • known-unknowns we uncovered in the design process
  • unknown-unknowns we didn’t forsee

One of the ways we can deal with all this error is to test our design with a prototype.

Prototyping is a way to explore the problem domain and gather feedback from the real world by building an approximate form of the solution you’ve designed.

The design is our hypothesis and the prototype is an experiment that tests it.

We generally start by ‘connecting the dots’ by implementing the most important use case end-to-end to:

  1. prove we can actually do it
  2. gather a bunch of information about what the data exchanged between components actually (needs to) looks like

Prototyping in Action

Here’s how this went for a prototype of the AWS access reporting system we are working on.

We are trying to generate a report summarizing the access each AWS IAM user or role has to

  • AWS Services like S3, RDS, DynamoDB
  • individual resources such as an S3 bucket or RDS database

We understood early that this would require many AWS API calls to retrieve the necessary information about principals, resources, and then simulate access. In the design process we wrote out the algorithm in pseudo-code, performed a Big-O complexity analysis, and came up with a general equation to describe the expected analysis runtime:

generate_access_summary runtime = get_principals_and_resources_in_account
    + (principals * services * avg simulate_principal_policy runtime)
    + (subset of principals * resources * avg simulate_principal_and_resource_policy runtime)

The last term dominates as it is O(num_principals*num_resources) in the worst, but relatively common case of IAM users and roles having full access to an account. Plugging some modest numbers in there like 100 IAM users+roles and 100 buckets gets you to 10,000 iam:SimulatePrincipalPolicy api calls. Which certainly looks bad.

But how bad? Will the analysis process complete in a reasonable time (where reasonable is within the 15minute Lambda function execution timeout)?

The answer depends on at least a few things:

  • average latency of the simulate api
  • throughput and rate limit for the simulate api

If the simulate API responds in 10ms with no rate-limit, the answer is around 2 minutes, even for a serial approach.

Turns out both of the factors I’ve brought up are incredibly important.

In our sandbox account, we consistently observe average performance of:

  • simulate_principal_policy: 600ms / request
  • simulate_principal_and_resource_policy: 800ms / request

This request latency information is not publicly available and there’s no committed SLA, as far as I know. So the only way to get the information was to gather it experimentally.

Also IAM rate limits this api action to about 8 requests / second, and of course we’ll want to avoid running close to the limit.

This all works out to about 2 hours to perform a full analysis of who has access to what.

Based on our prototype implementation, we have:

  • limited our request rate to about 5 requests / second
  • implemented principal and resource analysis constraints to keep the average runtime within 10 minutes
  • started thinking (more) about how to decompose and optimize this work; we’ll need it
  • confirmed customers like and can use the information in the report 🙂

Feedback, y’all!

I love design, feel free to send questions or comments on this series. We’ll be switching topics next week.