Infrastructure as code via AWS CDK

Decision

We manage our infrastructure with AWS CDK.

Problems

To be able to solve user problems, we not only need software, but we also need to deploy it together with other existing tools like databases. We need to repeat the configuration of our infrastructure across environments to be able to test in a test environment instead of having to make changes in production. In case of a catastrophic failure, we need to be able to recreate the production environment.

Additionally, configuring infrastructure manually is often costly, slow, inconsistent and inefficient.

Context

At the moment, we use both manual configuration and CloudFormation inside AWS, and manual configuration for all other tools (mainly MongoDB and Google Cloud Platform, but also tools like Sentry and LaunchDarkly). The CloudFormation is not in source control.

There are also some first small projects written in AWS CDK.

Options

Manual configuration
Infrastructure as Configuration via AWS CloudFormation
Infrastructure as Configuration via Terraform
Infrastructure as Code via AWS CDK
Infrastructure as Code via Pulumi

Reasoning

Why not manual configuration

Manual configuration would not give us the ability to manage and review changes similar to how we can do it for code.

Why Infrastructure as Code and not Infrastructure as Configuration

Being able to write configuration in TypeScript allows us to reuse patterns and knowledge from coding, like extracting reusable components and refactoring existing code without affecting the outcome. TypeScript uses types effectively to document code, so Infrastructure as Code can be picked up more easily by developers who know TypeScript, which makes it easier to spread the practice through our teams.

Since all infrastructure definitions would be under source control, it'll allow better visibility and there is a well defined process of how the changes are made with code review and approval process. Another advantage to using IaaS would be unit testing our infrastructure code, especially snapshot testing our synthesized CloudFormation templates.

Why AWS CDK and not Pulumi

AWS CDK has more components available than pulumi that can be reused or used as a sanity check for own implementations. This includes both community solutions as-well as solutions from AWS teams. They often expose functionality as methods that otherwise would require complex additional configuration, e.g. for measuring metrics or managing permissions.

AWS CDK is based on CloudFormation which some of our configuration is already written in. This makes it easier to move over some part of the existing infrastructure.

Consequences

We will need to teach across teams how to use AWS CDK so that newly created infrastructure can be defined via AWS CDK. Each team will also have an additional overhead of maintaining code related to infrastructure.

We will need to move over infrastructure slowly over time.

Since not all options available in the AWS Console are available in AWS CDK yet, we will not always be able to use the newest of the newest features.

How do we implement this change?

We do not need to roll out this change everywhere at once and teams can start implementing in their areas on their own timelines.

We can implement changes by package, and even for parts of a package at a time. The next packages to be implemented could be the admin-dashboard with its deployment, hermione, and an improved way to run jobs.

We already have some examples where we implemented AWS CDK that can be used as examples to understand potential future implementations:

For the sfmc-custom-activity and the client-dashboard-2, the infrastructure code lives in a infrastructure folder while the application code lives in a src folder.

Both access and powerbi-tools mainly consist of infrastructure code which lives in their src folders.

There is no example yet where one package deploys application code to infrastructure defined in another package, but it could look similar to the existing code for the monorepo application, which deploys to existing infrastructure.

Over time, we might need to create custom resources in AWS CDK to manage non-AWS infrastructure, e.g. GitHub infrastructure.

Who will implement the change?

Some first examples are already part of the codebase. Currently, Daniel (the CTO) is working on moving over the infrastructure for Hermione into AWS CDK. Another starting point is the Collect team that can use AWS CDK to set up infrastructure for improvements on the job runner.

When moving the admin-dashboard into this repository, this is also a good opportunity for the Deliver team to move over infrastructure for the admin-dashboard into AWS Amplify and AWS CDK with help of Daniel.

How do we teach this change?

Daniel will run a workshop on Learning Friday in September and share learning material like the AWS CDK Reference Documentation so everyone can continue learning on their own time.

Daniel will also set up dedicated examples and learning paths in the codebase to learn from.

New joiners will learn about the need to understand this technology from this ADR when going through the tech radar.

AWS CDK will be mentioned as a technology we use in future job descriptions to encourage people with experience in this technology to apply.

What could go wrong?

Not all infrastructure, not even all AWS infrastructure, can be managed in AWS CDK. For some, writing custom resources is a viable workaround, while for others we still need clickops for set up. If too much of the infrastructure cannot be easily handled via AWS CDK, then the benefits of a central overview of infrastructure might be lower than the additional effort it requires.

AWS CDK specifically, and working with infrastructure in general are both skills that not everyone in Product has yet. Managing infrastructure in code opens up access to this to a lot more people, which could open up bigger problems: An error in an app can be rolled back more easily than a mistakenly dropped database.

Not everyone might be interested in learning a new skill and spend the time to learn infrastructure and AWS CDK.

What do we do if something goes wrong?

We can roll out infrastructure via AWS CDK in small steps. If we no longer want to manage that infrastructure via AWS CDK, we can still manage it manually as before. We could also export the infrastructure as AWS CloudFormation from CDK.

What is still unclear?

No open questions.