Skip to main content

The HiSSS of Infrastructure - Part 2

In the first installment of this series, I outlined my Infrastructure Management methodology, called HiSSS. In that first posting we talked about the concept of High Availability. In this segment I'm going to tackle the notion of Stability.

Stability is pretty self-explanatory, simply, you don't want a system that is repeatedly tipping over. Just like we might say that an athlete has a stable stance when they're competing, we want our infrastructure to be strong and stable, so that it doesn't fall over, and leave the customer with a bad taste in their mouth. One of the first ways to do this is to control the rate of change in our systems.

Controlling when and how things happen in our systems is often called 'change management'. We want to manage any changes that are occurring, and mitigate any risks that might impact the stability of our systems. Often times, this change management process relates to software being developed, but it's just as important for infrastructure management. In particular, you want to have very strict guidelines as to when changes will be applied to systems, or when new software will be rolled out. You want good processes in place for handling ordinary operational business, and procedures for how non-standard changes get prioritized and implemented.

Perhaps an example would be good here, and one that also relates to software development. In order to maintain stable systems you want to control the number of times you need to deploy an application, because every time you do a deployment, it's a window for risk to creep in on. But, at the same time, infrastructure needs to embrace the model of Continuous Deployment, because in many shops, it's the future of software development. So how do you manage two, seemingly, opposite demands in this situation? With good change management controls. Instead of fighting against frequent software releases, infrastructure should support a strict change model that allows software to be deployed frequently, but with the least amount of disruption. Clear procedures, and timelines, can make the relationship between development and infrastructure very smooth. If everyone knows that there is a monthly (or weekly) release, and everyone knows exactly how the procedure will work on release day, QA sign-offs, and everything related to a release, then it becomes so ingrained to the workflow that velocity can be increased or decreased with minimal fuss.

But all the controls need to be in place for this to work right which leads to the second major factor in stability, the notion of automation. In order for software deployments (to continue our example) to work right, you need to make the process as automated as possible. Automation allows for repeatability, and repeatability usually means that you've achieved a level of stability. Even when tracking down bugs, being able to repeat a bug, means it's a stable bug, and much easier to locate and fix.

For every process that you want to repeat on a regular basis, it should revolve around pressing one button, running one command, or allowing a timed automated job to run. The amount of human intervention in any process should be bare minimum. As many possible contingencies should be accounted for and mitigated preventatively, and then everything should be set to run with as light a touch as possible. Automation is often one of those things that companies want to achieve, but it often requires going back, after the fact, to add it in. Many times, procedures and workflows develop over time, and it takes a lot of forethought to get everything automated right from the get go. Since that often doesn't happen, it gets easy to ignore it in the future. But that's the wrong choice to make. For a system to be stable, automation needs to play a key role, a crucial role, in infrastructure management.

But how do you know if your system is stable? That's the final aspect I want to present regarding stability, the notion of monitoring. It doesn't do much good to have a large system, with lots of moving parts, if you have no idea how those parts are functioning. Good system monitoring is key to maintaining a stable system. For basic system stability I like to refer to what I call 'short-term monitoring' (my monitoring philosophy will be a whole different blog series hehe). Short-term monitoring is the view into the current system state, and involves quick alerting of issues that need to be addressed immediately. Good short-term monitoring will often let operational staff discover problems before the customer. Pro-active fixes are ALWAYS positive. So having a good view into a system is crucial. If you don't know what's going on, how can you fix it?

The more a system evolves into a completely stable system, the better it can adapt and meet the future needs of the customer. A stable system, along with a highly available one, are two important keys in infrastructure management, and go a long way to achieving that "hum" of a well tuned engine that every infrastructure manager loves to hear.

Comments

Popular posts from this blog

The beat goes on

Yesterday Apple revealed their long awaited entry into the streaming music field. They were able to do this quickly because of the acquisition of Beats last year, and the systems and intellectual property that came with that purchase. Considering that the music reveal was pretty much the only big news out of a pretty benign developer keynote, I'll take a few moments to talk about what I think about it. Apple was perhaps the defining company in the music revolution of the past 20 years. With the introduction of the iPod that revolutionized portable music, to the creation of the iTunes store and the eventual death of DRM, Apple has been at the forefront of digital music. This leadership comes with high expectations to continue to lead, and so many people have long questioned Apple not getting into the streaming music business quicker. For the past few years new companies have come forth to lead the change in the streaming music evolution. From Pandora and its ability to create un

Microsoft Surface Pro 3

So I've been a horrible blog author and have neglected this site for far to long. It's not that I haven't had anything to say, I've just neglected to say it. So with an attempt to get back on the wagon, here's some thoughts on Microsoft's announcement yesterday for it's Surface Pro 3. Despite being a minor Apple fanboy, the most interesting company to watch, in the personal computing space right now, is Microsoft. With the departure of Steve Ballmer, and the rise of Satya Nadella, it has been an interesting 9 months for one of the founding pioneers of personal technology. Many agree that Windows 8 has not lived up to what Microsoft would like it to be. They made a bold attempt to redefine how users interact with their computers, and merge the tablet and desktop experience. However, that experiment, by most accounts, has failed. This is a common pattern for Microsoft however, alternating between a mediocre OS release, and then a stellar one. Therefore, it&#

The Great Experiment

Recently, a tech journalist that I've followed for many years, and who is an Apple fanboy, posted a series talking about why he switched from an iPhone to an Android phone . It's a good read, and worth the time to see why he made the decision he did. Since I have a Verizon Galaxy Nexus sitting on my desk as a Wi-Fi device, I thought, "What the heck, let's give this a go for a week." So for the past week I've shelved my trusty iPhone 5 and have delved deep into the world of stock Android 4.1. So in the spirit of "copying is the sincerest form of flattery" here's my write-up of my experiences with Google's mobile OS. First, I need to make one caveat. After using the Nexus for a week I have to say that I do NOT like this device. It constantly loses 4G signal, and the battery life almost makes it unusable. I could barely make it to lunch before I was at 20-30% battery. So in the spirit of fairness, if I truly wanted to switch full time to Andro