Using Developer Methodologies to Build Collaborative Operations

When I first wrote the post entitled Why You Should Embrace Version Control for Operational Artifacts, I had no idea that it would spawn an entire series on writing code against RESTful APIs. I began writing the content down to express what I had learned from working in a team of engineers who were primarily infrastructure-focused. It’s been almost a year since that post was released and I’ve taken a few more steps that I’ll share here.

If I could write the post and series over again, I think it would focus more heavily on the debate about operations folks needing to learn how to write code or not. I hear that a lot and it came up at the Interop Ask the Experts panel. You need to become a developer!

I don’t agree with that statement. It’s missing a few words. Let me try and fix it:

You need to learn the tools and methodologies that developers use to collaborate!

– Me

There, that’s much better! Let me dig into the guts of this idea.

Collaboration for Operations

Historically, the group of people who make up Operations have a collection of technology silos populated with skilled individuals who work independently. When asked to do something – often from a ticket or project queue – an individual will go and do it. Like setting up a server for running SQL or assigning a permission to a user account. When it’s done, the assigned tasked is marked completed. Perhaps people are notified (if they watch the queue / project). Perhaps not.

These tasks are treated as individual operations that have a beginning and an end. The actual work performed is often a mystery. How did that SQL server get configured? What permissions were actually assigned to that user account? The other team members don’t know. The work was done. We assume it was done well. Other tasks enter the funnel and get assigned. The wheel turns.

The primary issue with this model is that work cannot be audited, reviewed, or logged in a collaborative way. Sure, there are logs in some syslog dumpster somewhere. And audit logs. And change approval boards. But these are broken, fragmented collection systems that are only addressed when things go wrong. This reinforces a behavior of plodding along until a fire breaks out, and then working reactively to determine why.

Reviewing a change with the CAB
Reviewing a change with the CAB

Pull Requests and Testing

In the development world, changes are often scrutinized in two ways: pull requests to merge changes into pre-existing code and automated testing to determine if the change effects the defined desired outcomes. In simpler terms, a pull request is a way of saying “I have made these changes and I want you to pull them into the existing environment.” People don’t push changes to you; they invite you to pull them in yourself. Testing is a bit more obvious – changes are executed against a battery of automated tests to see if the outcomes match expectations. If the smoke alarm should report false after a change, but instead reports true, there’s a good bet that the change would cause a fire.

Note: Interested in distributed testing? I suggest taking a look at Matt Oswalt’s ToDD (Test on Demand … Distributed) project.

If the proposed change passes a review and the automated tests, the change is incorporated into the environment. Some changes are easy to accept, such as updating documentation due to a spelling error. Other changes are a bit more risky and require more scrutiny, such as anytime someone wants to add rm -rf to a script. You’d probably want to push back on that change. 🙂

This process is in stark contrast to individual members of the operations team doing tasks independent of one another. Instead, the operations team proposes changes, tests them, reviews them, and then commits them to the infrastructure. Examples include placing Ansible Playbooks into a Git repository and requiring that changes come through pull requests and are reviewed by members of the Operations team before being merged.

Building a History

As changes begin to percolate into the environment, a history (log) of changes forms. Rather than relying on tribal knowledge, a ticket queue, or some other platform, the version control system becomes the point of truth for historical flow. Each change is visible down to the very artifacts (files, code, etc.) that were changed. You can see who proposed the change. And who accepted (merged) the change. And any discussion related to the change – because sometimes the proposed change goes through several iterations before it is accepted (merged).

Delicious!
Delicious!

This solves that nagging question: why! In operations, it’s often imperative to understand why something was done. History is a wonderful teacher. It can reveal to more junior staff members:

  • Why the senior staff members are making changes
  • How those changes were brought about.
  • What other people thought of those changes.

Junior staff members can begin to learn the why and the how of data center operations. Goodbye, tribal knowledge (somewhat).

Remembering that Changes are Global

The final key idea to grasp is that changes in any data center are global. There is no such thing as affecting only one piece of infrastructure. Every component generates heat, pulls power, connects to a network, shares data, and creates logs. It does things. And those things that your infrastructure do affects other pieces, too. Thus, treating each task like an independent action is folly. Changes should be analyzed at the data center level. If I add rm -rf to this script, what is the big picture?

This is where visibility of change becomes vital, and ties back to the idea that operations folks can learn a lot from the collaborative nature of writing code. Change should be highly visible. Not just the requirements and outcome of a change – which is the abstracted state of a change approval board – but the actual changes themselves. Otherwise, you risk having a number of tribal knowledge experts, such as the one in The Phoenix Project, who bottleneck the operational workflows and inadvertently break things along the way.

While you don’t need to become a developer, learning the methodologies and tooling used by folks who code is the way modern data center operations are moving. Building an operational environment that is focused on collaboration, historical data, and being highly visible should be your goal. I’d suggest starting the journey with a distributed version control system, such as Git, and working from there. 🙂

Next Steps

Please accept a crisp high five for reaching this point in the post!

If you’d like to learn more about Continuous Integration, or other modern technology approaches, head over to the Guided Learning page.

If there’s anything I missed, please reach out to me on Twitter. Cheers! 🙂