Dynamic Configuration Patterns for Platform Engineering Teams

In Platform Engineering with Vending Machines, Contracts, and Pipelines, I state that “building a platform capable of scale means planning for scale in the first place.” This post will go deeper into the idea of vending machines by looking at the creation, management, and operations around inputs and outputs that feed the work. We’ll start with an overview of what Dynamic Configuration is to me, transitioning to various design patterns I’ve created in production and weighing in on the pros and cons of each.

Overview

Set a course for adventure!

Platforms need to scale across the Sphere of Key Scaling Components (the core, ring of services, and applications or tenants across the edge). This creates a 3-dimensional problem that must be solved using a balance of generality and fine-grained detail. If too much of the sphere is built from one-off code, the team grinds to a halt and becomes overloaded trying to build everything for everyone and no one is happy. On the other hand, requiring that all teams use the exact same templates and blindly maximizing code reusability results in a giant mess oozing technical debt and unhappy users.

It’s an interesting problem and one that requires combining a few different ideas into a vision for the future:

  • Gall’s Law: “All complex systems that work evolved from simpler systems that worked. If you want to build a complex system that works, build a simpler system first, and then improve it over time.”
  • Jevons Paradox: “When technological progress increases the efficiency with which a resource is used (reducing the amount necessary for any one use), but the falling cost of use induces increases in demand enough that resource use is increased, rather than reduced.” [Basically, when you make something easier and cheaper to acquire, people will want more of it.]
  • The Cone of Uncertainty

Discrete Units

What we really need is a way to split apart the complexity into two discrete units:

  1. Generality – increased efficiency and output!
  2. Specificity – increased features and user experience!

This is where the idea of Dynamic Configuration patterns come in handy:

  1. Generality (scale) – handled by a system of well-architected inputs
  2. Specificity (fine-grained detail) – handled by a single abstraction layer focused on use cases (not raw code)

Dynamic Configuration is the contract that exists between the two tension points, scale and detail. If we build a pattern that allows the scale of a system to access and make ephemeral couplings to invoke the detail of a system, then we have built a factory! Or, more correctly, a factory made up of many different factories.

A gigafactory, perhaps? ðŸĪŠ

Yo dawg, I heard you liked factories …

Considerations

Relevant questions to consider:

  • How do you manage for scale? If you wanted to deploy a change globally, how would you do it?
  • How do you manage for detail? If you wanted to deploy a change specifically, what is possible?
  • Are there use cases where you’d need to manage somewhere in between?
  • How simple is it to make the changes necessary to operate your environment?
  • Can a new engineer learn your platform in an hour, a day, a week, or does it take longer?

Well-Architected Inputs

I think of inputs in terms of a 3-dimensional matrix, like a Rubik’s Cube. This section will focus on the role that inputs play in building a platform.

The Input Matrix

Think about this: everything in a system can be understood by knowing what it is, who it is, and where it is. Put another way, if I have your service, environment, and account, I really don’t need to know anything else about you from a scale perspective, do I? I know enough to understand what you are and “route” you to runners and vending machines that do the grunt work.

From the scale’s point of view, it can answer the key questions to logically determine what pattern to select and execute:

  • What are you?
    • Oh, you are Service X!
  • Who are you?
    • Oh, you are the DEV environment version of Service X!
  • Where are you?
    • Oh, you are the DEV environment version of Service X running in Account 123!

These inputs are the mirepoix of platforms. ðŸ‘Ļ‍ðŸģ👌

Scaling, Meet the Matrix

Well-architected inputs heading out for an adventure

I firmly believe that simplicity is vital for scaling a system.

I’ve used this combination of inputs stored in GitLab CI variables (both system managed and custom) and AWS Parameter Store. When a vending machine was invoked by a pipeline, the GitLab CI job would “initialize” itself by pulling in the well-architected inputs for the service, environment, and account. This allowed our lightweight Platform SDK to do anything from vending accounts, building and testing application code, and deploying to the environments.

This also gives us the advantage of dynamic control over what is deployed to each service, environment, or account. Globally deploying something means telling the inputs that you want all accounts, services, or environments.

Just want the dev environment for a few services? Supply those inputs.

Want all test environments globally, except for services that end with “G”? Also easy with just a little bit of text-fu (or ChatGPT).

This forms the foundation of platform control.

Input Sources

Input management is key

But where do we actually get the information from? Where do the inputs actually live?

Inputs can from two types of sources – direct or indirect.

Direct inputs come from the source of truth itself. For example, if I’m a CI runner being executed in an AWS environment, I know how to lookup my account ID. Thus, you know how to get the answer you ultimately need. AWS CLI, one quick command, done. And we’d likely want that input to come from this source because there’s no abstraction layer required (yay!). We want to declare this type of work by using common functions whenever possible.

Indirect inputs come from an abstraction layer. Take the example of one service needing to lookup the execution address for an API gateway hosted by another service. While the source service obviously can’t just ask the other service (otherwise it would already know the URL), it can use a well-known name by referencing basic information like the service and environment. I should be able to ask for the URL based on the service and environment and get back the URL I need. DNS is another great example of this.

Metadata Layer

Indirect inputs require a metadata layer that can be accessed by your system to find things. I’ve often used GitLab CI variables for this as it allows me to define global, parent, or project-specific level inputs. I can store values here for others to retrieve without having to worry about cross-account permissions, so long as everyone is a part of my organization.

Metadata information must follow these rules:

  • Globally unique entries
  • Globally readable index (it’s OK to secure entries using RBAC)
  • Entry mutations come from the source of truth via automation (self managing)

The advantage is that, given a well-known name, I can easily use a helper function from a Platform SDK to make it easy for jobs to ask questions. When structured well, I use the pieces of information that I know to lookup the piece(s) I don’t.

Metadata Core in CI/CD

Inputs will ultimately need to make it into the continuous delivery pipeline. I like using a hierarchy of inheritance to control information based on intent: global, parent, and project (application) defined. When stored in the CI/CD tool, such as GitLab CI, we’ve already solved for global uniqueness and global readability.

The figure below shows the input hierarchy in action, along with the priority in which inputs are passed to the pipeline:

At a global level, I find it helpful to store things like image names, domain names, and shared services data or URLs. The types of things just about anyone would need regardless of the type of application or what environment it lives within. This is also why I like making a root group in GitLab.

At the parent level, we start to see domain inputs that impact the application holistically and are applied to the “parent” logical project folder to an application. Information on internal services, addresses, common environment variables, baseline feature flags, and more. If metadata applies broadly to an application or application set, you’ll find it at the parent group level.

Finally, at the project (application) level, we see information specific to each environment for an application. This information is stored in the account itself so that lookups remain constant across all queries.

If any higher layer conflicts with a lower layer, we take the more granular answer. This provides a mechanism to override the system, such as defining a granular input during a merge request to examine behavior or passing along an overriding input directly into a child pipeline to alter it’s behavior (usually for testing).

Inputs are managed by the Platform SDK, not by hand. Account vending machines build or update the inputs. Application deployments consume (and also update) the inputs, specifically the ones they are responsible for updating as owners of that value. In short, we let the system maintain the inputs and merely act as auditors (observability, growth, maintenance) and developers (overrides, troubleshooting, feature flags).

Metadata Edges in Accounts

Given a dev, test, stage, and prod environment (plus testbeds for short-lived experiments), this gives us a repeatable pattern for storing information inside of an account for later retrieval. This extends the metadata layer’s edges into the account and provides a standardized method to store whatever is necessary to deploy and operate the services. The application can also access (and update) these values, which is a lot like being given a virtual environmental variables file.

For example, the flowchart above shows that given a specific domain application (service), each account is pulling from the same key name (with the exception of testbeds). Work can be performed within the account independent of the environment, with a slight tweak to the control logic to account for testbed resources. I like altering the string for testbed items to make them obvious and visible to the operations team.

Tooling Experience

Tools have a huge impact on velocity

What I’ve shown you above can be abstracted or not, depending on your team’s desire to incorporate an opinionated tool into your release process or not. I personally prefer to build these input control components into my platform in a way that is both opinionated but also boringly simple to iterate and maintain. But it’s also popular to use a PaaS type of experience.

I hold no judgement! 🙂

With that said, GitLab is my tool of choice for a very long list of reasons and based on experiences over the past 9 years. It does 90% of what you need to build software straight out of the box. My clients rave about GitLab once they see it in action. Amazing user experience, really kick-ass dynamic pipelines, a fantastic upgrade path tool, and you can watch (and emulate) the GitLab workflow for building software because they drink their own champagne (and do it publicly). I constantly look at their software delivery workflows for inspiration. If you’re not at least giving them a shot, you’re doing yourself a disservice.

Continuous Delivery

Once we have all the input logic sorted out, we can make the assumption that we’ll have:

  • Service
  • Environment
  • Account

The contract now shifts over to the continuous delivery side of the house. Whenever a pipeline is triggered by a code change, parent pipeline, or other event, we can just assume that the runner code we write will have those pieces of information. This pattern takes an infinite amount of permutations and groups them into knowing those three inputs. It greatly simplifies the logic required to DO something.

We’re now in the pipeline’s world and need to think about what it knows.

  • Service – The pipeline was triggered by a specific event, so it knows why it is being run and who is running it, thus it knows the service.
  • Environment – The pipeline file has a list of steps to follow for getting code to dev, test, stage, and prod, so the pipeline knows the environments and deployment order.
  • Account – Account metadata is available from the parent input, so the pipeline can feed in account IDs with quick lookups.

The pipeline is capable of making decisions and routing them to the right jobs.

Platform Contract

The first stop is to consult our platform contracts.

Infrastructure as Code

Contracts are bridges between teams

As a reminder, a platform contract usually contains information on these types of outcomes:

  • Network connectivity and security to core services
  • Infrastructure for core services to leverage, such as state buckets, log buckets, cross-account roles, and metadata placement
  • Observability and monitoring deployments
  • CI/CD workers or connections to worker pools (such as container clusters)

For me, this is typically Terraform code being run using a platform owned image with all of our tools baked in, including the Platform SDK. This contract should be invisible to your application developers. It’s just infrastructure plumbing and should be boringly simple.

The power of the platform contract is that it’s global. Everyone gets the same contract with the platform. Think of it like the USB-C port for your system. We need everyone to plug into the same adapter interface but can offer a wide variety of experiences through that USB-C port!

Configuration Layers

While we want all environments for an application using as identical of infrastructure and configurations as possible, that’s not reality. Cost constraints will almost always force tough choices on where to more tightly-couple that what you might want in an ideal world. The way we account for that is through dynamic configuration.

The platform contract is the combination of two things:

  • Global platform contract – the things everyone gets
  • Environment specific platform contract – deviation in resources or resource configuration

For example, I worked with a development team that wanted to make sure every domain service had its own private database. Turns out that’s a super expensive way to build and, although conceptually ideal, wouldn’t really give us much in return for the much heavier AWS bill.

So, we built something a bit more shared and coupled but put a lot of thought into how it was done. We treated everyone as a tenant of a shared database service that we offered, and cleanly defined the demarcation points between teams. Plus, we made sure the pipeline created and tested migration scripts to help safeguard each team’s logical database.

With Terraform, this is simple – just add Terraform files that reflect the aggregation of your base contract and any one-off needs or configuration differences between environments. I commonly see this used to control the size of things, such as a small instance type for dev instances versus something larger and more tuned in production.

We put logic like what’s shown above into the Platform SDK (see a pattern here?). It focused the team on improving the value of our SDK tool and then getting that power into the hands of everyone who wanted or needed it.

Service Contract

If platform contracts form the connective tissue between the core layer and services layer, then the service contract picks up at the services layer and builds out all of the infrastructure needed to empower the specific application being published.

A service contract usually contains information on these types of outcomes:

  • Service specific infrastructure, such as a Cognito user pool or micro-/service database.
  • Environment specific configuration nuance, such as the type and size of a database for dev versus prod.
  • A bucket used to build artifacts for the application, such as packages or dependencies, with lifecycle rules established.

The beauty here is that we use the exact same input process and pipeline flow to handle the service contract. The only real difference is where the trigger event is coming from. Service contracts are interrogated when source code for an application changes as part of the release process.

Summary

The adventure continues …

In this post, we went deeper into the idea of Dynamic Configuration by looking at the creation, management, and operations around inputs and outputs that feed the continuous delivery work for applications running on a modern platform.

I think that one of the biggest advantages of the Dynamic Configuration model is the fine-grained control. The platform contract handles the broad strokes, while the service contract fills in the little details. You can target the test and deployment of just about any combination of services, environments, and accounts. Adding new Day 0 domains and deploying Day 1 applications uses the same workflow as Day 2 operations.

It’s all the same day, every day.

✌ðŸĪŸ