I lead a Platform Engineering team in my day job. My specific expertise is in public sector (government) work at enterprise scale. This comes with a very interesting and unique set of requirements and constraints. It’s not an easy job, but I do very much enjoy it and get to work with great people.
It feels really good to directly impact citizens and their communities with well designed and operated technology.
These are some lessons I’ve learned while building highly secure platforms in this environment.
But first, a rant about Platform Engineering
Since vendors love to steal terms and then toxify them to mutate into sales tools, let’s level on what I consider to be Platform Engineering.
A platform exists to provide consistency across N quantity of products or services. They are multi-tenant in their design, driven by pipelines, and constructed of units of automation. Thus, there is an engineering team that is accountable and responsible for building this platform and all the underpinnings (landing zones) needed.
The purpose of the platform engineering team is to make the time between someone (anyone) having an idea, and then seeing if it actually works, as close to zero as possible. Low friction, quick responses, rich feedback. That experience is the engine that powers the development of a product.
Creating and maintaining a platform requires numerous disciplines from what we would traditionally label as cloud engineers, security architects, or DevOps engineers. I normally see these groups being siloed. This is a mistake. Instead of creating siloes amongst those roles, a platform engineering team is one team (or collection of pods) that forms the triangle that is development, security, and operations, and expresses that triangle through a consistent, reliable, and performant platform.
Hot take: You don’t need Kubernetes to build a platform. I personally avoid the complexity of such “orchestration Rube Goldberg machines.” Instead, I opt for simple, scalable constructs that require as little operational toil as possible while still allowing for scale and abstraction. Thank you for coming to my TED talk.
If you’re reading beyond this point, I am genuinely surprised.
A Sphere of Key Scaling Components
I think of platforms as spheres with three layers:
- The core stuff sitting at the root that rarely changes (baseline roles, SSO services, etc.);
- A ring of services and accounts used to build and run applications that acts as an integration layer (the CI system, vended accounts, network plumbing);
- The applications or tenants themselves sitting at the edge of the sphere (functions, databases, gateways).
The platform can expand or contract at will, even to the granularity of a single layer of the sphere, with that being one of the major benefits. Plus, the sphere holds its shape, representing the consistency of services across the global footprint.
Starting a New Sphere from Scratch
It can be hard to build this sphere from nothing.
Keep it simple. Great products take 3 versions to make. Form an Idea, Scale it, Abstract it.
The first version is just to see if the idea works. Just remember that the Cone of Uncertainty is never wrong; don’t make one-way door commitments early.
Growing the Sphere
Next, simple needs to become larger and more complex, but should not be complicated.
There are two common friction points in scaling this out from scratch:
- As you build out the first environment for an application, numerous new services will be brought online in rapid order. Plan to solve for scaling the number of services provided to the applications. The core layer shouldn’t require too much attention until …
- … you go to build the next environment. If this is your first time solving the architecture for this, expect to spend a lot of time refactoring how the core and integration layers cooperate with one another. Scaling has shifted from the small things (projects and applications) up to large things (multiple projects across multiple environments). This is a fun and challenging shift for your team to solve; they will be your best data source for pain points.
One solved, though, these friction points should largely remain off the radar.
Enhancing the Sphere
Enhancements should be regular and impactful as the team learns about unknown unknowns. When vending accounts and environments feels simple and push button, you are in a good state.
Good Code Creates More Good Code
As the consistency layer, a platform team’s code is often the code that checks, validates, and enforces that same consistency for everyone else. It’s hard to build for others to be consistent if you’re not consistent yourself. Build feedback loops to help you understand this journey, like reports, dashboards, and pipeline quality gates.
Have examples of what good code looks like early in the process. Agree on headers, footers, comments, formatters, pre-commit hooks, and just generally “what should our code look like” for everyone to see. Use a folder to store working examples with your code.
Again, the focus is to do this early on in the process. Future refactoring will benefit, as it will be more about consolidation and streamlining of ideas versus having to shuffle around comment blocks.
I’ve had good luck with a versioned specification for things like the infrastructure as code being written. “Go check out Terraform Specification v2.6” is super clear and detailed whereas “go read main.tf from project X on branch Y” is not. It also makes it easier to automate checks against the specification in the pipeline and giving the engineer feedback in the report.
Quality Platform Engineering is Speedy Platform Engineering
Most engineers are distracted by “the shiny things” out there. Admit it, we all are. It’s a hard habit to break. However, I have found a better way in the art of Quality Engineering.
The idea is that for each change you need to make, there are two roles needed to effectively make the change:
- The platform engineer role
- I make the thing!
- The quality engineer role
- I provide feedback while you make the thing, and ensure it’s correctly merged into main!
The two work together. They write the story together, scope the work together, define the solution together, and define the tests to validate the solution works together. This isn’t a wall you throw code over, it’s a partnership to be able to go fast and not break things.
When it’s time to merge into main, the quality engineer performs the final checklist of testing and validation, makes sure that there’s a peer review, then ensures the merge is successful and resulted in a passing release.
I’m simplifying things here a bit, but know that a model of this arrangement works well in reducing defects, increasing velocity, and boosting morale. Quality efforts are worth a full blog post.
Another plus point: this model is especially helpful when you’re trying to scale skills. Having senior engineers working every step of the way with junior engineers is a win-win for both parties. It’s rewarding to be part of someone’s growth, and it’s rewarding to grow. This also scales the future state vision across a wider audience.
The Magic Numbers of Scale
There’s only a few correct answers to the quantity of something. If you’re ever wondering if you’ve built something for your platform using an optimal pattern, find the right numbers.
There should only ever be ONE source of truth. If you find multiples of the same thing, you have conflict.
One config to reference. One well-defined name for a thing. One path in a tree to traverse. One level of abstraction to peer through. One place for a pipeline to inherit its environment configuration.
If you have more than one of a thing, you cannot know the truth of a thing. You now have an unknown quantity of updates to make, or an unknown amount of configurations to keep aligned. Complex logic is required to solve this, to peer into the void. Avoid this.
A note on abstraction: I firmly believe in the rule of one layer of abstraction. Abstracting an abstraction means you have built a product, and products require a lot of tricky maintenance. In rare cases this is justified. Simple is almost always better than complex, and complex wins against complicated.
If a thing does not need to be made one time, strive to be able to make it INFINITE times. It doesn’t mean you actually need to MAKE that many things. It just means that the PROCESS to make something should be repeatable and that, as you scale, the same effort can be expressed logarithmically against a finite working set of constructs.
Put another way, if I push the button, 2 or 2 million things should be able to be created with about the same amount of effort.
Write your stories from the perspective of the constructs you make.
Ask yourself these questions:
- What am I? (inputs)
- What should I look like? (blueprint, module, template)
- What version do I use? (lifecycle)
Over time, INFINITE becomes trivial and the operational bottleneck of toil shifts towards orchestrating and managing the inputs themselves. I would advise using this pain point as an indicator to refactor whatever is being used for input orchestration (e.g. Parameter Store, CI vars) given the amount of data you’ll have on where the bottlenecks are.
Exceptions to the above quantities are called “technical debt.” 😁
File these in your backlog and fix them when the time is right. There are many times I find myself not caring about having, for example, 2 of a thing for a short period of time because I’m not sure which one I like better.
You’ve reached the end of this post!
A 🍪 for you.