Building a platform capable of scale means planning for scale in the first place. For me, this comes down to the formation of contracts – specifically platform and service contracts – and then using vending machines and pipelines to ensure the contracts are enforced and consistent.
This post provides some thoughts on each of these constructs, how I use them, and where they fit together. This pattern has served me well in designing and building platforms used by developers in a domain-driven model with multiple daily deployments. With that said, your company’s org chart (silos) will have a huge impact on the viability of using these patterns; contracts are worthless if no one abides by them.
Vending machines are the worker bees of a platform. They provide a consistently repeatable process for building and maintaining portions of the platform, with consistency being the key characteristic. A vending machine works by having a set of source code, such as Terraform plans, a script invoking an SDK, or JSON policy statements, stored in a project that accepts inputs to control what is delivered.
Vending machines are triggered by events, such as change to the code or a human invoking a request via API, which causes a CI/CD pipeline to run through the rules and steps required to perform a deployment. At the end of the pipeline run, a chunk of infrastructure and platform components are created, mutated, or destroyed based on the inputs provided.
Hence the name vending machine – you insert currency and select a product by pressing on the number pad (the inputs) and out pops the drink or food you were looking for (the deployment). The next person to perform the exact same steps gets the exact same results. However, your vending machine is written in code, so you can offer as many different products (and permutations on products) as you need.
Common vending machine use cases:
- Cloud account provisioning, publishing, retirement, often related to landing zones
- CI images (containers) being built and updated for CI/CD jobs to consume, commonly with domain-specific tools and SDKs added to reduce job runtimes
- Configuration management for projects such as templates, branch rules, or approval rules that are custom to your needs (no vendor is going to provide 100% of your controls in their UI)
- Dynamic network isolation, typically for compliance tooling and to keep your environments fully isolated in a shared network landscape
- Sandbox, testbed, and playground environments that have some level of “auto stop” functionality to save on costs and self-clean
Vending machines have some level of self-service to them. At first, self-service means “platform team members” are the consumers as there is often some wizardry required to kick off a vending machine trigger. Later, it’s pretty common to put a CLI/SDK in front of a vending machine so that platform and product teams can consume the service directly. This more mature offering requires coordination and feedback loops from the product teams, but is well worth the investment.
Now that we know more about vending machines, let’s focus on the “Cloud account provisioning, publishing, retirement” use case. For this, we’ll need to construct contracts!
Contracts are patterns used by vending machines to logically understand what is being built and managed. They form the edge between the logical design (what the inputs are asking for) and the physical design (how to build the thing).
Contracts go by many names; you might hear them as components, modules, or boundaries.
Just like with vending machines, contracts have inputs and outputs. This is how people interact with your contract. A cloud provisioning vending machine, for example, should really only need inputs such as the environment being deployed (dev, test, stage, prod), the service or domain being deployed, and an account ID value to target for deployment. These inputs are fairly simple to understand and provide, and that’s the point – the contract abstracts away infrastructure and platform details as a tradeoff for the consistency being provided.
Having a finite set of contracts is one of the ways to unlock SCALE. Each contract requires maintenance and support, so be picky about how many contracts you make and what they do. Negotiate with your users!
Contracts are versioned (everything should be versioned). This gives you the opportunity to experiment with new versions, test them, and then roll them out using the deployment model of your choice, such as through feature flags, blue/green, or just a simple rollout from dev to prod.
I prefer to use two types of contracts – a platform contract and a service contract – to represent the bifurcation between “things everyone needs” and “things a service needs.”
In Lessons I’ve Learned Leading a Platform Engineering Team, I talk about “A Sphere of Key Scaling Components.” Given that context, platform contracts form the connective tissue between the core layer and services layer by providing infrastructure to make that connection. We want this contract to be as globally leveraged as possible to ensure that each service is given a consistent experience when accessing core services and being a part of the organizational design. Platform contracts aim to be global.
This platform contract usually contains information on these types of outcomes:
- Network connectivity and security to core services
- Infrastructure for core services to leverage, such as state buckets, log buckets, cross-account roles, and metadata placement
- Observability and monitoring deployments
- CI/CD workers or connections to worker pools (such as container clusters)
A platform contract should be independent of the application environment being supported. Meaning, a platform contract will vend a dev environment account the same was as it does a test, stage, UAT, or production environment accounts. This is the power of a platform contract – it can be globally adopted to abstract away the infrastructure used to scale and operate the platform itself.
In especially large environments, you may need to create a few different types of platform contracts to reflect business units, cloud providers, or the next version of a large migration. These tend to happen as the business and politics within shift around or when the organizational structure divides teams into silos (yuck). My advice here is to keep the core components of the contract intact and try to share as much of the architecture as possible, either conceptually or logically, to remain consistent.
If platform contracts form the connective tissue between the core layer and services layer, then the service contract picks up at the services layer and builds out all of the infrastructure needed to empower the specific application being published.
A service contract usually contains information on these types of outcomes:
- Service specific infrastructure, such as a Cognito user pool or micro-/service database.
- Environment specific configuration nuance, such as the type and size of a database for dev versus prod.
- A bucket used to build artifacts for the application, such as packages or dependencies, with lifecycle rules established.
Notice that we’re not building the application itself in a service contract. We’re just making sure all the “uniqueness” of an account used by an application and it’s environment has been applied. There is no need for a developer to deal with these things; the platform team can easily own their creation, lifecycle, and retirement “as a service” to the product teams in partnership with one another.
A major benefit to having a service contract decoupled from the application itself is to break the shear line between these two workstreams. When these are tightly coupled, application releases are now tied to a specific set of infrastructure and require a lot of oversight and coordination from multiple teams. Keeping the platform infra loosely coupled allows teams to move faster and more independently, while retaining the safety of the contract. When the service contract needs to change, simply huddle up across teams and align on the change.
Outputs from the service contract are very useful. As the infrastructure is deployed, outputs provide the metadata necessary for the application (and peer applications) to learn about what exists. For example, if I create a new API gateway, I’m also going to get a new execution URL. By publishing that URL to a well-known name stored as metadata (such as to a secrets manager or parameter storage), anyone looking at the metadata will see and begin using the name URL value.
Whenever we want to feed a set of contracts into a vending machine to build or manage something, we use a pipeline. A pipeline is a set of tasks needed to read the contracts, interpret what needs to be built, and then go build it. Pipelines are the life blood of a product. They allow multiple people to work on the design of a product in discrete logical pieces (teams, projects, capabilities) that align to the organization, then receive bursts of information related to the decisions made with batteries of testing. Continuous Integration (CI) Fundamentals has more of my thoughts on pipelines from a logical perspective.
Contracts are built by pipelines; a runtime handling the pipeline tasks will pick up the contracts, read through them, and then determine what needs to be done in order to fulfill the contracts. This could mean invoking a terraform image to plan or apply infra or simply updating a set of environment variables for a lambda function. When the pipeline concludes, the contract should be fully vended and happy, or helpfully pointing out which issue(s) caused a fault. So long as the contract inputs and outputs behave the same, consumers of the contract are not involved with the sausage making that produces the results.
Consistency builds trust, and trust builds products.
Pipelines understand their contracts through inputs, often coming from inherited CI variables and pipeline feature flags. Given that a contract only needs three things for account vending – the account itself, the service being deployed, and the environment associated with the service – these are not too hard to place within a CI/CD hierarchy.
Upon the completion of the pipeline, all artifacts dynamically generated to build and execute the contracts should be saved and archived as part of the SBOM (Software Bill of Materials) and to help with troubleshooting (as needed).
Great platform design is really about having a plan for scale. Scaling a platform is all about managing inputs and outputs across a self-enforced consistency layer. One of the best ways I’ve found to do this is through the creation and management of contracts for the platform itself and the services (applications) being built by product teams. Consistency is achieved through using pipelines and vending machines to give a repeatable, reliable, and performant method to enable product teams to test their ideas and release them to their customers.
And with that …
✌ & 🤟