In the first installment of this series, I outlined my Infrastructure Management methodology, called HiSSS. In that first posting we talked about the concept of High Availability. In this segment I'm going to tackle the notion of Stability.
Stability is pretty self-explanatory, simply, you don't want a system that is repeatedly tipping over. Just like we might say that an athlete has a stable stance when they're competing, we want our infrastructure to be strong and stable, so that it doesn't fall over, and leave the customer with a bad taste in their mouth. One of the first ways to do this is to control the rate of change in our systems.
Controlling when and how things happen in our systems is often called 'change management'. We want to manage any changes that are occurring, and mitigate any risks that might impact the stability of our systems. Often times, this change management process relates to software being developed, but it's just as important for infrastructure management. In particular, you want to have very strict guidelines as to when changes will be applied to systems, or when new software will be rolled out. You want good processes in place for handling ordinary operational business, and procedures for how non-standard changes get prioritized and implemented.
Perhaps an example would be good here, and one that also relates to software development. In order to maintain stable systems you want to control the number of times you need to deploy an application, because every time you do a deployment, it's a window for risk to creep in on. But, at the same time, infrastructure needs to embrace the model of Continuous Deployment, because in many shops, it's the future of software development. So how do you manage two, seemingly, opposite demands in this situation? With good change management controls. Instead of fighting against frequent software releases, infrastructure should support a strict change model that allows software to be deployed frequently, but with the least amount of disruption. Clear procedures, and timelines, can make the relationship between development and infrastructure very smooth. If everyone knows that there is a monthly (or weekly) release, and everyone knows exactly how the procedure will work on release day, QA sign-offs, and everything related to a release, then it becomes so ingrained to the workflow that velocity can be increased or decreased with minimal fuss.
But all the controls need to be in place for this to work right which leads to the second major factor in stability, the notion of automation. In order for software deployments (to continue our example) to work right, you need to make the process as automated as possible. Automation allows for repeatability, and repeatability usually means that you've achieved a level of stability. Even when tracking down bugs, being able to repeat a bug, means it's a stable bug, and much easier to locate and fix.
For every process that you want to repeat on a regular basis, it should revolve around pressing one button, running one command, or allowing a timed automated job to run. The amount of human intervention in any process should be bare minimum. As many possible contingencies should be accounted for and mitigated preventatively, and then everything should be set to run with as light a touch as possible. Automation is often one of those things that companies want to achieve, but it often requires going back, after the fact, to add it in. Many times, procedures and workflows develop over time, and it takes a lot of forethought to get everything automated right from the get go. Since that often doesn't happen, it gets easy to ignore it in the future. But that's the wrong choice to make. For a system to be stable, automation needs to play a key role, a crucial role, in infrastructure management.
But how do you know if your system is stable? That's the final aspect I want to present regarding stability, the notion of monitoring. It doesn't do much good to have a large system, with lots of moving parts, if you have no idea how those parts are functioning. Good system monitoring is key to maintaining a stable system. For basic system stability I like to refer to what I call 'short-term monitoring' (my monitoring philosophy will be a whole different blog series hehe). Short-term monitoring is the view into the current system state, and involves quick alerting of issues that need to be addressed immediately. Good short-term monitoring will often let operational staff discover problems before the customer. Pro-active fixes are ALWAYS positive. So having a good view into a system is crucial. If you don't know what's going on, how can you fix it?
The more a system evolves into a completely stable system, the better it can adapt and meet the future needs of the customer. A stable system, along with a highly available one, are two important keys in infrastructure management, and go a long way to achieving that "hum" of a well tuned engine that every infrastructure manager loves to hear.
Stability is pretty self-explanatory, simply, you don't want a system that is repeatedly tipping over. Just like we might say that an athlete has a stable stance when they're competing, we want our infrastructure to be strong and stable, so that it doesn't fall over, and leave the customer with a bad taste in their mouth. One of the first ways to do this is to control the rate of change in our systems.
Controlling when and how things happen in our systems is often called 'change management'. We want to manage any changes that are occurring, and mitigate any risks that might impact the stability of our systems. Often times, this change management process relates to software being developed, but it's just as important for infrastructure management. In particular, you want to have very strict guidelines as to when changes will be applied to systems, or when new software will be rolled out. You want good processes in place for handling ordinary operational business, and procedures for how non-standard changes get prioritized and implemented.
Perhaps an example would be good here, and one that also relates to software development. In order to maintain stable systems you want to control the number of times you need to deploy an application, because every time you do a deployment, it's a window for risk to creep in on. But, at the same time, infrastructure needs to embrace the model of Continuous Deployment, because in many shops, it's the future of software development. So how do you manage two, seemingly, opposite demands in this situation? With good change management controls. Instead of fighting against frequent software releases, infrastructure should support a strict change model that allows software to be deployed frequently, but with the least amount of disruption. Clear procedures, and timelines, can make the relationship between development and infrastructure very smooth. If everyone knows that there is a monthly (or weekly) release, and everyone knows exactly how the procedure will work on release day, QA sign-offs, and everything related to a release, then it becomes so ingrained to the workflow that velocity can be increased or decreased with minimal fuss.
But all the controls need to be in place for this to work right which leads to the second major factor in stability, the notion of automation. In order for software deployments (to continue our example) to work right, you need to make the process as automated as possible. Automation allows for repeatability, and repeatability usually means that you've achieved a level of stability. Even when tracking down bugs, being able to repeat a bug, means it's a stable bug, and much easier to locate and fix.
For every process that you want to repeat on a regular basis, it should revolve around pressing one button, running one command, or allowing a timed automated job to run. The amount of human intervention in any process should be bare minimum. As many possible contingencies should be accounted for and mitigated preventatively, and then everything should be set to run with as light a touch as possible. Automation is often one of those things that companies want to achieve, but it often requires going back, after the fact, to add it in. Many times, procedures and workflows develop over time, and it takes a lot of forethought to get everything automated right from the get go. Since that often doesn't happen, it gets easy to ignore it in the future. But that's the wrong choice to make. For a system to be stable, automation needs to play a key role, a crucial role, in infrastructure management.
But how do you know if your system is stable? That's the final aspect I want to present regarding stability, the notion of monitoring. It doesn't do much good to have a large system, with lots of moving parts, if you have no idea how those parts are functioning. Good system monitoring is key to maintaining a stable system. For basic system stability I like to refer to what I call 'short-term monitoring' (my monitoring philosophy will be a whole different blog series hehe). Short-term monitoring is the view into the current system state, and involves quick alerting of issues that need to be addressed immediately. Good short-term monitoring will often let operational staff discover problems before the customer. Pro-active fixes are ALWAYS positive. So having a good view into a system is crucial. If you don't know what's going on, how can you fix it?
The more a system evolves into a completely stable system, the better it can adapt and meet the future needs of the customer. A stable system, along with a highly available one, are two important keys in infrastructure management, and go a long way to achieving that "hum" of a well tuned engine that every infrastructure manager loves to hear.
Comments
Post a Comment