Author Archives: Paul Guth

About Paul Guth

Unknown's avatar
Old Timey Web Ops guy. I think about cars and clouds, and how they could be faster, cheaper, and more resilient.

Change Management Workflows

In a previous post I covered the basic fields you need in a CR (Change Request) and promised to look at optional info that more heavyweight CM processes might use.  In this post I’m going to focus on workflows: approvals, reviews, and validations.  While there are other fields that you can find on a CR, it’s really the workflows that are a critical part of CM.  The basic CR we looked at last time could be used even if only one person is involved in the entire process – but typically multiple people are involved.  One person might request a change, someone else might execute it, someone else might validate that it worked, etc.  If you have workflows like these that involve multiple people, you really want to have those workflows recorded in your ticketing system – so that you can see when one step is completed (and by whom) and the next has begun, and so that you can tell from your ticketing system what stage a ticket is in and who currently owns it.  Please note that I’m not saying you HAVE to incorporate these flows into your CM process – they are appropriate in some places but not in all.  As I said in earlier posts, you should have the right amount of process and paperwork that’s appropriate for your environment – no more, no less.

Here in fantastic ASCII format is a basic CM process, showing how review/approval/validation workflows might be linked:

Create  –> Review –> Approve –> Schedule –> Execute –> Validate

Each of these steps could potentially be performed by a different person or role.  If this seems crazy to you, note that in some cases it may be required for compliance or governance reasons that different people be involved in the various stages of the process.

Create

A request for a change often comes from a non-technical resource.  It might be a customer, a project manager, a product manager, etc.  The request will not have any execution details in it, it will just say what is being requested and why.  This will then be picked up by a technical resource that will create the actual procedure to be executed, including rollback and validation steps – hopefully with an eye towards automating the change in the future.  🙂

Reviews

A technical review step is generally performed by a peer of the person who created the execution procedure and/or will be executing it.  The purpose of the review is to ensure that the procedure is complete and correct and doesn’t require any special or unique knowledge on the part of the executor.

There may also be a business review step, where a peer of the original requestor evaluates the “why” behind the change request to ensure that it is appropriate to execute.

The output of a review step may involve changes to the procedure itself or sending the CR back to the original author for modification, in which case it will come back through the review step after those modifications are made.

Approval

The approval step is a gate prior to execution.  It is often combined with the review step above, but may be separated – usually this is done when technical reviews are done by SMEs in the relevant systems or technologies, but those SMEs may not have larger/wider knowledge about all the systems that could be impacted.  In such cases there will typically be a separate approval (which may be done by a committee like a CAB) signifying that the change has been reviewed in a larger context and is deemed safe and appropriate for the system as a whole.

Schedule

Changes may be executed at the discretion of the executor, or there may be more central scheduling for some or all changes  One typical case is when there is a defined maintenance window where changes that impact customer-facing services are performed.  These windows are often planned and scheduled by a specific person or role to ensure that the limited time of the window is used most efficiently while guarding against change collisions.  (Executing multiple changes simultaneously can lead to great difficulties with diagnosis if something goes wrong – since you’re not sure which of the changes is the cause of the problem).  Even outside of maintenance windows, changes may be centrally scheduled for other reasons.

Execution

Someone actually performs the change.  Nothing special to see here.

Validation

After the change procedure has been performed, there may be a handover to another individual (typically a peer of the executor) to perform technical validation of the change.  A second pair of eyes can sometimes see unintended effects that otherwise would have been missed, and also eliminates any temptation to “paper over” what appears to be an inconsequential deviation from the CR (“it was just a one character typo!”).  This validation should be tightly coupled in time to the change execution as until this validation is complete the system is potentially in a bad state.

Separately, there may be a business validation step to ensure that the intended effect has occurred.  This could potentially be done at some remove in time from the execution and technical validation.

Record your workflows!

Arguably the most valuable part of using a ticketing system to track CRs is how easy it makes it to handle workflows.  You can see at a glance where something is in the process and who owns it.  You can record the results of each step in a workflow, including who performed it and timestamps.  And you can use the ticketing system to actively manage the workflow, including auto-assignment and sending reminders or SLA warnings if appropriate.


Getting Ubuntu to send through Gmail for Nagios

Recently I set up a Nagios installation running in a VM on my laptop (Virtualbox on a MacBook Pro).  This was a quick and easy way for me to get some monitoring going without needing a real server somewhere.  (I know, I know – “Use the cloud dumbass!”  It’s complicated).

Anyway, the trickiest part was getting the notifications to actually go to my gmail.com account. Gmail doesn’t accept regular old SMTP connections.  You need to use authentication over TLS (or SSL).  I didn’t know how to set this up, but I found this great page that explains most of it:

Mark Sanborn’s guide to getting postfix to send through Gmail

This guide was written several years ago and while it almost worked for Ubuntu 10.10, two changes were required.

The first change was to use the new Equifax CA cert because the one that Gmail is currently using isn’t part of the postfix config in 10.10, although the right cert is already on the system. Fixing this is very straightforward:

cat /etc/ssl/certs/Equifax_Secure_CA.pem >> /etc/postfix/cacert.pem

The second change is to tell the mail server to actually use the transports file that the guide has you set up. This must have already been there in 2007 but wasn’t there for me. This was a one-line change to /etc/postfix/main.cf:


root@gunther-VirtualBox:/etc/postfix# rcsdiff main.cf
===================================================================
RCS file: RCS/main.cf,v
retrieving revision 1.1
diff -r1.1 main.cf
19a20,22
> # use a transport file to send stuff to google for gmail.com
> transport_maps = hash:/etc/postfix/transport
>

(Yeah I still use RCS when I’m modifying config files. Try it!
apt-get install rcs ; mkdir RCS ; ci -l file
You’ll love it).

By following the instructions at the link and adding these two steps, my VM was sending email to my gmail.com account and everybody was happy.


Anatomy of a Change Request – The Basics

In most IT environments you’ll find some kind of Change Request (CR) form.  Some of them are simple forms for simple workflows and some of them…well, aren’t.  What does a typical CR look like?  If you’re creating a Change Management (CM) process for your organization (and you should have one!), what should your CR look like?

In this post I’ll talk about the very basic information that should be in every CR. In a subsequent post I’ll go through some of the optional information that more heavyweight CM processes may use.

A minimal CR

Any CR should have at least the following information:

  • Title
  • Requestor
  • Executor
  • Execution Time
  • Purpose
  • Procedure (including execution, validation, and rollback)
  • Results

Let’s go through these one by one:

Title
This is a short (less than one line) summary of the CR, used mainly for displaying CRs in lists.
Requestor
Who asked for this change? This is important to have in case there are any questions about what should be done or decisions that need to be made about different options that can be chosen. If you don’t know who requested it, you can’t get answers to those questions.
Executor
Who is actually doing the change? This is important to know for later troubleshooting purposes – if something goes wrong you’ll want to consult the person who made the change as they will have the best knowledge of what happened and if anything strange occurred.
Execution time
For troubleshooting it is critical to know exactly when changes took place, so you can correlate with service impacts or other important events. (Your CM process may record execution time as part of the change workflow itself, in which case it’s not critical to have it actually in the CR – but it needs to be somewhere).
Purpose
Why is this change being made? What is the business value of doing this? This is the field I see missing most often. Everyone involved in the CM process should understand the reason why changes are being made – and those reasons should be tied to the needs of the business. This understanding allows everyone to make informed decisions at every stage about priorities, strategies, tactics, etc. Without this understanding, the people making the changes are disconnected from the business and become disengaged and jaded, eventually leading to poor decisions.
Procedure (execution, validation, rollback)
What are you going to do? What order are you going to do it in? How are you going to make sure it worked, and didn’t break anything else? What are you going to do if something goes wrong? There are many different viewpoints on what level of detail and rigor this procedure needs to have – there is no one right answer but I always think of every CR as a candidate for future automation, and the more detailed, specific, and complete the procedural section of the CR is, the easier it will be to automate in the future.
Results
What happened when the change was executed? Typically this part of the CR will contain pasted output from execution or validation commands, or screenshots showing the effective change, etc. If there are any problems later this prevents wasted time while people ask “did you do _____” or “what does ______ command show?” A tiny amount of work to cut’n’paste some info here can save a huge amount of heartache later.

This may seem like a lot of information for a simple CR, but in practice it doesn’t take very long to fill these out for simple changes. And for complicated changes, you shouldn’t be worried about the extra overhead of typing – if you’re not thinking through and planning your complicated changes, you’re taking big risks with your business.

Where does a CR form live?

When your CM process gets started, CR forms will often be simple documents – they could be in GDocs (this is how we do it at my company today), they could be in a wiki, or they could live directly in the ticketing system that manages your CM workflow (if you have one). What’s important is that the CRs be easy to fill out and easy to find later.

How do I start using a CR form?

Once you’ve created your CR form, the next step is simple. Just start using it for your changes! Ideally the person in charge of your infrastructure already understands the value of CM, and will be eager to have everyone start using the CR. If that’s not the case, use the CR form yourself, and ask others to use it. Even if no one else does, at some point there will be an incident that will make the value of using CRs obvious to everyone – and when that happens you’ll be ready.


Why Change Management?

Recently I had the opportunity to create a template for infrastructure change requests at work. Based on the reaction from some of my co-workers, I thought it might be valuable to explain what change requests are for. In a subsequent post I’ll go through what a basic change request looks like.

Change Requests are part of the Change Management (CM) process. Now don’t get freaked out, that doesn’t mean we need forms filled out in triplicate sent through multiple people for review and approval. Processes can have as much or as little heft as required to meet the needs of your organization. But if your infrastructure’s availability is important to you, you should have a CM process. We are a small startup, so our CM process is very lightweight. Here are the main tenets:

  1. Think about a change before you start executing it
  2. If something is high-risk, test it before you do it for real
  3. Know how you’re going to handle it if something goes horribly wrong
  4. Record that you made the change so people can find it later if they need to (for example, when troubleshooting a problem)

Point 1 (think before you execute) is really philosophical. After many years of doing production web operations, I’m convinced based on the empirical evidence that you’re far more likely to screw something up if you just start cowboying your way through a change rather than planning it ahead of time. You see this point of view in other contexts as well (“plan your flight, fly your plan”). Many times when planning a change, I have thought of something new as I’m doing the planning that I would otherwise have encountered during execution – something that in the heat of the moment would have caused me great panic. Better to hit that and work through it when you’re not stressed out in the middle of a big production change. For me one of the most important parts of having a written Change Request is that it enforces thinking through a change before you execute it.

Point 2 (test high-risk changes) may sound obvious but there are certainly nuances. How do you determine what’s high-risk and where do you draw the line? How much time do you spend doing testing vs simply rolling back a change if it does cause problems? I’ve found that it’s best to leave these decisions in the hands of the people executing the changes – but your CM process needs to remind them to ask these questions, think about the answers, and use their best judgment.

Point 3 (how to handle problems) is not theoretical. If your job is web operations, you will be involved with a change that goes horribly wrong. It just happens. When it happens, if you have not thought about it ahead of time you will be up a smelly brown creek without a paddle. This is when panic sets in, and in the heat of those moments some spectacularly bad decisions can be made which could make the situation even worse. Spending some time prior to execution thinking through potential failure scenarios allows you to execute your rollback plan calmly and effectively. Which way do you prefer?

Point 4 (change recording) is absolutely critical unless you a) never forget anything and b) are the only person involved in the support of your infrastructure. In my experience, the majority of thorny production problems are caused by changes, usually when they introduce latent faults that don’t manifest as incidents for a while. When diagnosing such a problem, it is critical that you know what changed when, and that is precisely the purpose of change recording. There are a million ways to do this, from sending emails to a “changelog” alias or putting change summaries in IRC to having a CMDB with change records in it. Less important than the specific mechanism(*) is that you have a mechanism, that people use it religiously, that it’s easy to search for changes at particular times and to particular systems, and that everyone knows where to find it and how to use it. What seems like busywork when you’re performing a change (“Why do I have to write this down? It’s already done!”) will pay giant dividends when it prevents someone from spending tons of time reverse engineering what happened while the service is down.

(*) – Note: one thing you really should leverage is version control for your CM and recording processes – it’s invaluable for being able to track a sequence of changes and to easily pull back a previously working configuration.


Monitoring – Getting the Most Bang For Your Buck With WCCAM

When you run a service that others depend on (i.e. you have customers) you have a responsibility. Your service should work when people want to use it. If it doesn’t you’re letting them down – and likely costing yourself money. But you have limited resources to invest in keeping the service up – how do you spend them most wisely? What’s the best bang for your buck when it comes to monitoring? (Hint: it’s probably not what your monitoring system is desgined for!) Let’s look at your options:

Infrastructure monitoring

Your typical monitoring system solution will tell you lots about your infrastructure, meaning the servers and network devices that your services run on. You’ll have ping tests to make sure servers are alive and disk space checks to make sure that they can write new information that they need to. The monitoring system will also record lots of system-level metrics for you to look at: how busy your servers are, how heavily utilized your network links are, etc. For a service with lots of customers, you’ll have a lot of infrastructure. It may be dozens, hundreds, or thousands of devices. A decent monitoring system will tell you right away when any of those devices fail or are having serious problems.

But that’s not good enough! In fact, it’s often useless;. Unfortunately this is the kind of thing most monitoring systems are really good at. But how valuable is it to check 3 times every minute that your disks are still 77% full? So you can ping a server – do your customers care?

Infrastructure monitoring tells you if a server or a router go down.  Do your angry customers typically complain that “your router is down!” or “your database server is down!” when they call you about problems? If the answer is no – read on.

Application Monitoring

On top of your infrastructure you have applications – the software that provides the services your customers consume. Good application monitoring will involve looking at individual processes on your servers, and looking at the operational interfaces those processes provide to you: primarily logfiles and statistics. If your application monitoring is decent you’ll know right away when any of your software gets into a bad state.

That’s not good enough either!  The service you provide is not the software.

Do your angry customers typically complain that “the indexing queue is really backed up” or that “the shopping cart middleware has stopped accepting requests” when they call you about problems? If the answer is no – read on.

What Customers Care About

Your customers use what you’ve created for a reason. It provides a benefit (or benefits) to them. That is what you want to be monitoring. What would your customers say if you asked them “Why do you use our service? What does it do for you?” Take that answer and figure out how to monitor it. Maybe the answer is “I use your service to make payments to people I buy things from.” OK, then your monitoring system needs to be able to measure making payments. (NOTE: not the servers that are involved in making payments – not the software that is involved in making payments, but making payments is what you need to measure and monitor). If the answer is “I use your service to read about what my friends are doing” then your monitoring system needs to be able to measure people reading about what their friends are doing. After lots of searching in vain for a decent name, I call this “What Customers Care About Monitoring” or WCCAM (rhymes with Wiggam, like the police chief in The Simpsons). This is what you really care about – that the value you provide to your customers is working.

These are probably also exactly the things your customers do complain to you about. “I can’t make a payment!” “I can’t read the status updates from Soandso!” If you listen to customer support calls, these are the kinds of things you’ll hear. In fact, in lieu of asking customers directly what they use your service for, the next best thing is to ask your customer support folks what people complain about – that’s an excellent pointer to what your critical services are from a customer perspective.

Measure your services – measure your value!

Once you’ve identified the services your customers care about (like making payments, or reading updates from their friends), figure out what characteristics of those services are critical. Possibilities include:

  • performance – response time, load time – how quickly can they get to it?
  • functional correctness – is it doing what it’s supposed to?
  • availability – can they reach it when they want to?

Then figure out how to measure and monitor those characteristics.  I know, I know – that’s not easy.  That’s why you get paid to do it!

What does this do for me?

Effective WCCAM pays off in much higher availability for your services, which means happier and more satisfied customers.  It can reduce both MTTR and MTTF by providing much faster detection of customer impacting problems. If you have ever had an outage or disruption that was reported to you by your customers rather than your monitoring system then you already know the value of WCCAM. There are many situations where all of your individual devices and applications are up and running, but the overall service is not working because something is wrong with the connections between those services or with some external dependency those services have. WCCAM tells you about these problems – infrastructure and application monitoring do not. This earlier detection can dramatically reduce your MTTR.

Effective WCCAM also can lead to faster triage and diagnosis, again reducing MTTR. Since what you are measuring is what customers care about, it’s much easier to distinguish an important problem from a trivial one – letting you prioritize what you’re going to do intelligently.

WCCAM points the way

Let’s review:

  • What you should monitor is what your customers care about.
  • So monitor the services you provide to your customers – not (just) your infrastructure.
  • WCCAM will let you find and address problems more quickly, leading to happier customers, a happier business, and a happier universe.
WCCAM will also provide great data to inform the decisions you make down the road about which services need investment in stability fixes, what your customers’ experiences with your service are, and how you stack up against your competition.  WCCAM is not the only monitoring you need – you still need to have infrastructure and application monitoring.  But when you have limited resources and have to prioritize and make choices – make sure you’re putting WCCAM at the top of the list, because it gives you the best bang for your buck.

Reliability vs Availability and the Magic of MTTR

When talking about online systems, you’ll often encounter the terms “availablity” and “reliability.” As in “The CloudNetanitor 10000 provides industry-leading reliability” or “We built the Securovator LX system using exclusively high-availability components.” These two concepts are closely linked but different: each of them has its own specific definition.

Availability

Availability is how often something works and works correctly when you try to use it. It’s generally expressed as a percentage, like “SuperLogBuzzer achieved 99.94% availability in the month of June.” The simplest way to calculate the availability is to take the number of successful uses of the service, and divide it by the total attempts. Mathematically:

(successes) / (successes + failures)

Reliability

Reliability is how long something works without breaking. Unlike availability it is generally expressed in time. So if you manufacture engines and your engines on average last for two years of continuous operation, the reliability of your engines is two years. Generally reliability is expressed as an average since every engine is going to be different – some may last six months, some may last six years. There’s even a specific term (and acronym) for this average: MTTF, standing for “Mean Time To Failure.” You see this a lot for things like disk drives, which may say “measured MTTF of 6,000 hours.” Some systems will use MTBF or Mean Time Between Failures – that’s almost but not exactly the same thing – close enough for now.

So which is better?

Availability is what really matters when talking about services you use or services you provide to others. If you can’t reach http://www.google.com, do you really care how long it’s been since the last time it broke? Of course not – you just care that it’s not working right now when you want to use it.

The underlying story here is that availability captures both reliability (MTTF) and another critical concept: mean time to repair (MTTR). Whether or not you can use Google when you want to depends on two things: how long Google goes in between breaking, and also how long it stays broken for when it does break. Both pieces affect your experience with the service, and availability captures both.

What does this mean for me?

Focus on MTTR. If your system is reasonably well-designed and you didn’t cut a bunch of corners, your MTTF is almost exclusively determined by the architecture and the intrinsic characteristics of the components (hardware and software). This means it’s difficult to make significant changes in MTTF without either rearchitecting, spending lots of money on more redundancy (which can cause its own reliability problems by adding complexity), or changing to an underlying platform with different reliability characteristics. By contrast, there are often large gains to be realized in MTTR without nearly as much investment by relatively simple changes in tools, techniques, and organization.

Tracking MTTR

How do you focus on MTTR? First, track it. If you don’t already have somewhere you record every incident – get one. It doesn’t really matter what you use as long as it’s easy to update and allows structured data to enable analysis once you’ve got some data. Once you have that, make sure you’re recording data that allows you to determine MTTF and MTTR for your services.

MTTF is a straightforward number – it’s the time between when your last service interruption ended and when your next service interruption starts. MTTR on the other hand can be further subdivided, and your recording system should allow you to track the following components of it:

  1. Time to detect
  2. Time to triage
  3. Time to diagnose
  4. Time to fix

Make sure your tracking enables you to measure each of these components so you can see where your biggest opportunities for improvement are, and so you can measure the effects of what you’re doing to make things better.

Monitoring at the service level

To reduce detection times, make sure you have monitoring at the service level. There’s a whole different post here, but the essence is that you want to monitor what your customers are paying for – they don’t care if your database is up or down and neither should you unless it’s affecting the service. I’ve seen many cases where every piece of infrastructure is extensively monitored and yet the entire service can go down without any alerts – because it’s the interactions between those services that were affected – and those weren’t monitored.

Have Data. Use Data.

Triage and diagnosis times can be greatly reduced by having the right information. Hopefully your applications and systems are already recording lots of useful data in a place where you can easily see and analyze it. The most important information you need for diagnosing tricky problems is a clear understanding of the dependencies in your system. You should have a data model of your infrastructure somewhere – is it easily visualized? Better still, is there an API to it so your tools can not only show it to you but use it themselves when making decisions?

Build your tools to help you do the right thing

How long it takes to fix a problem once you know the cause (or before you know the cause, as long as you know where the problem is) is largely dependent on the design of the system and the tools to control it. When you’re building your tools make sure you think through the typical use cases – make it easy to do the right thing and try to prevent people from making the wrong choices – even if that does limit the flexibility of the tool.

Leverage automation

Automation can help you at each stage: detection, triage, diagnosis, repair. You want people to spend their time making decisions and once those decisions are made, the computers should do most of the work.

Make things better, one outage at a time

If you have a good system, focusing on MTTR will give you the biggest bang for your buck when it comes to increasing your availability. If you really want to improve MTTR, make sure you learn everything you can from the outages you do encounter. There is no better way to understand the weaknesses of your system than to examine what caused and contributed to real failures in production. Don’t waste a single opportunity!