In Lessons I’ve Learned Leading a Platform Engineering Team, I tease out various learnings that I wanted to capture for future me. And that’s all well and good if you have the context and experience in designing, building, and maintaining platforms. But most folks do not.
This post aims to address how Platform Engineering works in the real world. I’ll do my best to break down the important bits.
Guiding Principles of Platform Engineering
You and your organization want to build something. That something is a piece of software. So, you gather up a bunch of people who know how to build software and say “go build me the software.”
Themes are created to capture the major pieces and parts of the software. Epics are squeezed out of Themes to add scope and timelines to the work. Stories are written to fulfill the Epics. All of this is organized using Jira and divided up into Sprints that last 2 weeks.
If software were a car, you’ve just designed a concept car. It’s a fast car that has a lot of features – power windows, fancy view screen, panoramic sky views, powerful brakes – but it’s still a car. Which means the car must be built, and it must have roads, and there must be constraints placed upon the roads to keep the car safe. The system is based on the rules of a car.
This statement always reminds me of The Matrix 😊
There’s a building. Inside this building there’s a level where no elevator can go, and no stair can reach. This level is filled with doors. These doors lead to many places, hidden places, but one door is special. One door leads to the Source. This building is protected by a very secure system. Every alarm triggers the 💣. But like all systems it has a weakness. The system is based on the rules of a building. One system built on another.Keymaker, Matrix: Reloaded
The scale part
In fact, not only do we need to build the car but we need to build a LOT of VERSIONS of the car to see if we designed it correctly. Likely we need to build 10s or 100s of thousands of iterations on the car to get it right.
So we don’t just need a car. We need a FACTORY that builds cars. And that factory must be able to build a lot of different VERSIONS of a car, all at the same time, without the designers (the software engineers) all bumping up against each other.
We also need those roads I talked about. Service roads for factory worker bots to traverse, interior roads for the designers to test drive on, highways to speed test the vehicle for performance and drag, toll roads to ensure people pay for the right to drive the car down certain paths, and so on.
This is where lots of memes describe how hard it is to build factories. It is, in fact, not that hard. It is HARD WORK, yes, but the idea of building factories is one built upon generations of knowledge and understanding. If you don’t want to build factories and would prefer to just “buy something from a vendor” that does the work, you’re either in the wrong field or at a scale where there is no complexity to solve.
Back to guiding principles
With all that said, let’s return to the point of this section: the guiding principles of Platform Engineering.
There are three:
- These are the people who build the factories that build the car
- These are the people that build the roads for the car to drive upon.
- These are the people who strive to make it really really simple and fast to build, test, and deploy cars.
Platform Engineers are PRODUCT DESIGNERS, and their PRODUCT is one that builds someone else’s product at scale.
If you’ve heard people going back and forth about “Is the platform a product?” then the answer is very obviously and definitively YES. Platforms are PRODUCTS. If you don’t believe platforms are products, you’re either doing it wrong (which happens a lot) or are yourself wrong. I’m looking at YOU, tech vendors that want to sell magic beans and 💩 influencers 💩 that don’t actually build anything beyond a home lab environment and PowerPoint slides.
Great Platform Engineers are Collaborative Generalists
Give the above guidance, it makes sense to select the right people to do this work.
You want Collaborative Generalists.
These are the type of engineers that have experience across a fairly wide breadth of technology and technology disciplines. They intrinsically see a bunch of moving parts not as individual objects but as a workflow to connect, integrate, and orchestrate. The team is capable of swarming on new problems, pivoting based on the data they find, and then translating a bunch of requirements and constraints from software specialists (the folks making the car) to build something that works well for all.
Tim Cain has a great video on generalists and their decline below. The point about generalists being the real force multiplier and the best tool makers hits home because it’s true. And it’s true for any software team dynamic, not just in the gaming industry.
Generalists are the people you want building platforms.
Specialists build for themselves. Generalists build for others. Not always, but typically.
How to Build a Platform Product
Given that platforms are products, how does one build a platform? The same way as with a product!
The first three versions of a product:
- Prototype an idea
- Scale the prototype
- Abstract the scale
Version 1: Prototype
I tend to call this the “bootstrap” phase of product development. Sprint burndowns will look like cliffs with a lot of jagged spikes all over the place, and that’s OK.
Critical components to build:
- Constructing a team contract for design principles, design flows, and an operating manual for how the team (and the platform) work together
- Constructing an Architecture Decision Record (ADR) library and how to propose, review, accept, reject, and supersede design records
- The beginnings of your libraries – pipeline libraries, module libraries, role libraries, and any other “when we need a pattern to use, we go here” libraries
- Landing zone development, network architecture, account hierarchy, and other fundamental building blocks used to provide global services across your platform
- Pipelines to continuously deploy core infrastructure (don’t worry about the applications just yet, focus on the core services needed)
Signs that you are finished with version 1:
- We can use this platform when we use it the way it was designed!
- 💎 WE KNOW WE’RE ON THE RIGHT PATH! 💎
- (But it doesn’t really scale well and some parts are still janky)
- Platform core services are built via Continuous Deployment
At this point it’s time to design version 2 of the platform product.
Version 2: Scale
The next version of your platform product should be aimed at scale. Taking what you’ve learned and built in the first version and largely throwing away the raw code and refactoring the LEARNINGS from that code into new libraries and new pipelines.
Critical components to build:
- Integration services begin taking shape, being able to connect large pools of core services to specific applications and application teams
- Service blueprints to act as your 📜 CONTRACTS with application services – it is infinitely easier to have a standard contract used by services than to wait for something to break
- Vending machines to manage all aspects of control for the platform, such as account vending, approval rules, protected branches, and so forth (I also think of vending machines as CONSISTENCY MACHINES)
- Tools to enforce PROCESS, such as scans to detect drift, scans to detect when deviation from specifications occur, and other pieces of feedback that act as quality gates when change is introduced
- Pipeline feature flags to control the behavior of jobs, runners, and dependencies
- Note: this is different from application feature flags, which control the application’s feature availability
- Simplified data collection for everything you build, such as logs, are available to anyone who needs them, even if there isn’t a 😁 single glass of pain 😁 for everyone to view just yet!
I should probably spend an entire blog post just going over pipeline feature flags and how they work. Feature flags are insanely good.
Signs that you are finished with version 2:
- Application services in the lowest / first environment (e.g.
dev) are built via Continuous Deployment
- Pre-merge feedback is provided to application teams without day-to-day platform involvement
- Quality gates are enforced before code is promoted without day-to-day platform involvement
At this point it’s time to design version 3 of the platform product.
Version 3: Abstract
This is the really hard part. Abstraction is only possible when state is removed. Removing state requires three-dimensional thinking because there are X quantity of services across Y quantity of versions of that service across Z quantity of environments (usually 3 or 4).
Version 3 is all about finding every little bit of state that you can and putting it somewhere well-known.
Critical components to build:
- Stateful data is stored in sharded matrices
- Stateful data is mutated by the source of truth explicitly
- A mechanism to control pipeline feature flags across a complex ecosystem of services, which allows for one pipeline codebase (a mono-repo) to be used across essentially everything on the platform
- End-to-end pipeline workflows capable of deploying to all environments based on source code mutation triggers
- A well defined release process using protected branches, naming consistency, tagging consistency, and a publishing pipeline
- A well defined hotfix contract exists to pull code into the right place given the severity of the risk before the risk happens (don’t try to invent this during a showstopper issue, it will suck)
Signs that you are finished with version 3:
- Application services in all environments (e.g.
prod) are built via Continuous Deployment
- 80%+ of work is spent on iterating and growing libraries and then controlling rollouts via pipeline feature flags
- Error budgets and pipeline performance requirements control prioritization and focus of the platform team’s work
- ANYONE can pull up a dashboard showing what is going on in any environment and for any service without really knowing what the heck is going on
If politics do not allow for Continuous Deployment of the application, the platform should still ENABLE you to use Continuous Deployment later while providing the flexibility to choose a path where someone presses the big, red “deploy to production” button. Continuous Delivery should be the bare minimum here.
Platform Engineering takes the best of a lot of different theories, concepts, patterns, and disciplines and lays them across the product development mindset to come up with an amazing product that does amazing things from amazing people.
I’ve spent over 20 years of my life talking about how toxic and horrible siloes are, and platform engineering offers a unique solution to building without siloes. It was the entire point behind the Datanauts Podcast, for example. No more “well you’re a DevOps engineer and you’re a network engineer and you’re the specialist that does the database thing, and we’re just going to throw meetings at each other to make something.” That’s the wrong way to build. Don’t do that.
Put those who build platforms on a team. Collaborate to build something amazing.
Platforms are the right way to build if you want it done right.
And with that …