December 7, 2016

Day 7 - What we can learn about Change from the CIA

Written By: Michael Stahnke (@stahnma)
Edited By: Michelle Carroll (@miiiiiche)

I’ve recently spent some time on the road, working with customers and potential customers, as well as speaking at conferences. It’s been great. During the discussion with customers and prospects, I’m fascinated by the organizational descriptions, behaviors, and policies.

I was reflecting on one of those discussions one week, when I was preparing a lightning talk for our Madison DevOps Meetup. I looked through my chicken scratch notes I keep on talk ideas to see what I could whip up, and found a note about the CIA Simple Sabotage Field Manual. This guide was written in 1944, and declassified in 2008. It’s a collection of worst-practices to run a business. The intent of the guide is to have CIA assets, or citizens of occupied countries, slow the output of companies they are placed in, and thus reducing their effectiveness in supplying enemies. Half of these tips and tricks describe ITIL.

ITIL comes from the Latin word Itilus which means give up.

ITIL and change control processes came up a lot over my recent trips. I’ve never been a big fan of ITIL, but I do think the goals it set out to achieve were perfectly reasonable. I, too, would like to communicate about change before it happens, and have good processes around change. Shortly thereafter is where my ability to work within ITIL starts to deviate.

Let’s take a typical change scenario.

I am a system administrator looking to add a new disk into a volume group on a production system.

First, I open up some terrible ticket-tracking tool. Even if you’re using the best ticket tracking tool out there, you hate it. You’ve never used a ticket tracking tool and thought, “Wow, that was fun.” So I open a ticket of type “change.” Then I add in my plan, which includes scanning the bus for the new disk, running some lvm commands, and creating or expanding a filesystem. I add a backout plan, because that’s required. I guess the backout plan would be to not create the fileystem or expand it. I can’t unscan a bus. Then I have myriad of other fields to fill out, some required by the tool, some required by your company’s process folks (but not enforced at the form level). I save my work.

Now this change request is routed for approvals. It’s likely that somewhere between one and eight people review the ticket, approve the ticket and move it state into ready for review, or ready to be talked about or the like.

From there, I enter my favorite (and by favorite, I mean least favorite), part of the process: the Change Advisory Board (CAB). This is a required meeting that you have to go, or send a representative. They will talk about all changes, make sure all the approvals are in, make sure a backout plan is filled out, make sure the ticket is in the ready to implement phase. This meeting will hardly discuss the technical process of the change. It will barely scratch the surface of an impact analysis for the work. It might ask what time that change will occur. All in all, the CAB meeting is made up of something like eight managers, four project managers, two production control people, and a slew of engineers who just want to get some work done.

Oh, and because reasons, all changes must be open for at least two weeks before implementation, unless it’s an emergency.

Does this sound typical to you? It matches up fairly well with several customers I met with over the last several months.

Let’s recap:

Effort: 30 minutes at the most.

Lag Time: 2+ weeks.

Customer: unhappy.

If I dig into each of these steps, I’m sure we can make this more efficient.

Ticket creation:

If you have required fields for your process, make them required in the tool. Don’t make a human audit tickets and figure out if you forgot to fill out Custom Field 3 with the correct info.

Have a backout plan when it makes sense, recognize when it doesn’t. Rollback, without time-travel, is basically a myth.

Approval Routing:

Who is approving this? Why? Is it the business owner of the system? The technical owner? The manager of the infra team? The manager of the business team? All of them?

Is this adding any value to the process or is it simply a process in place so that if something goes wrong (anything, anything at all) there’s a clear chain of “It’s not my fault” to show? Too many approvals may indicate you have a buck-passing culture (you’re not driving accountability).

Do you have to approve this exact change, or could you get mass approval on this type of change and skip this step in the future? I’ve had success getting DNS changes, backup policy modifications, disk maintenance, account additions/removals, and library installations added to this bucket.


How much does this meeting cost? 12-20 people: if the average rate is $50 an hour per person, you’re looking at $600-$1000 just to talk about things in a spreadsheet or PDF. Is this cost in line with the value of the meeting?
What’s the most valuable thing that setup provides? Could it be done asynchronously?

Waiting Period:

Seriously, why? What good does a ticket do by sitting around for 2 weeks? Somebody could happen to stumble upon it while they browse your change tickets in their free time, and then ask an ever-so-important question. However, I don’t have stories or evidence that confirm this possibility.

CIA Sabotage Manual

Let’s see which of the CIA worst-practices to implement in an org (or perhaps best practices to ensure failure) this process hits:

Employees: Work slowly. Think of ways to increase the number of movements needed to do your job: use a light hammer instead of a heavy one; try to make a small wrench do instead of a big one.

This slowness is built into the system with a required duration of 2 weeks. It also requires lots of movement in the approval process. What if approver #3 is on holiday? Can the ticket move into the next phase?

When possible, refer all matters to committees, for “further study and consideration.” Attempt to make the committees as large and bureaucratic as possible. Hold conferences when there is more critical work to be done.

This just described a CAB meeting to the letter. Why make a decision about moving forward when we could simply talk about it and use up as many people’s time as possible?

Maybe, you think I’m being hyperbolic. I don’t think I am. I am certainly attempting to make a point, and to make it entertaining, but this is a very real-world scenario.

Now, if we apply some better practices here, what can we do? I see two ways forward. You can work within a fairly stringent ITIL-backed system. If you choose this path, the goal is to keep the processes forced upon you as out of your way as possible. The other path is to create a new process that works for you.

Working within the process

To work within a process structured with a CAB, a review, and waiting period, you’ll need to be aggressive. Most CAB process have a standard change flow, or pre-approved change flow for things that just happen all the time. Often you have to demonstrate a number of successful changes of a type to be considered for this type of change categorization.

If you have an option like that, drive toward it. When I last ran an operations team, we had dozens (I’m not even sure of the final tally) of standard, pre-approved change types set up. We kept working to get more and more of our work into this category.

The pre-approved designation meant it didn’t have to wait two weeks, and rarely needed to get approval. In cases where it did, it was the technical approver of the service/system who did the approval, which bypassed layers of management and production control processes.

That’s not to say we always ran things through this way. Sometimes, more eyes on a change is a good thing. We’d add approvers if it made sense. We’d change the type of change to normal or high impact if we had less confidence this one would go well. One of the golden rules was, don’t be the person who has an unsuccessful pre-approved change. When that happened, that change type was no longer pre-approved.

To get things into the pre-approved bucket, there was a bit of paperwork, but mostly, it was a matter of process. We couldn’t afford variability. I needed to have the same level of confidence that a change would work, no matter the experience of the person making the change. This is where automation comes in.

Most people think you automate things for speed, and you certainly can, but consistency was a much larger driver around automation for us. We’d look at a poorly-defined process, clean it up, and automate.

After getting 60%+ of the normal changes we made into the pre-approved category, our involvement in the ITIL work displacement activities shrunk dramatically. Since things were automated, our confidence level in the changes was high. I still didn’t love our change process, but we were able to remove much of its impact on our work.

Automating a bad proess just outputs crap...faster

Have a different process

At my current employer, we don’t have a strong ITIL practice, a change advisory board, or mandatory approvers on tickets. We still get stuff done.

Basically, when somebody needs to make a change, they’re responsible for figuring out the impact analysis of it. Sometimes, it’s easy and you know it off the top of your head. Sometimes, you need to ask other folks. We do this primarily on a voluntary mailing list — people who care about infrastructure stuff subscribe to it.

We broadcast proposed changes on that list. From there, impact information can be discovered and added. We can decide timing. We also sometimes defer changes if something critical is happening, such as release hardening.

In general, this has worked well. We’ve certainly had changes that had a larger impact than we originally planned, but I saw that with a full change control board and 3–5 approvers from management as well. We’ve also had changes sneak in that didn’t get the level of broadcast we’d like to see ahead of time. That normally only happens once for that type of change. We also see many changes not hit our discussion list because they’re just very trivial. That’s a feature.


If you work in an environment with lots of regulations preventing a more collaborative and iterative process, the first thing I encourage you to do is question those constraints. Are they in place for legal coverage, or are they simply “the way we’ve always worked?” If you’re not sure, dig in a little bit with the folks enforcing regulations. Sometimes a simple discussion about incentives and what behaviors you’re attempting to drive can cause people to rethink a process or remove a few pieces of red tape.

If you have regulations and constraints due to government policies, such as PCI or HIPAA, then you may have to conform. One of the common control in those types of environments is people who work in development environment may not have access or push code into production. If this is the case, dig into what that really means. I’ve seen several organization determine those constraints based on how they were currently operating, instead of what they could be doing.

A common rule is developers should not have uncontrolled access to production. often times companies see that mean they must restrict all access to production for developers. Instead however, if you focus on the uncontrolled part, you may find different incentives for the control. Could you mitigate risks by allowing developers to perform automated deployments and by having read-access for logs, but not have a shell prompt on the systems? If so, you’re still enabling collaboration and rapid movement, without creating a specific handover from development to a production control team.


The way things have always been done probably isn’t the best way. It’s a way. I encourage you to dig in, and play devil’s advocate for your current processes. Read a bit of the CIA sabotage manual, and if starts to hit too close to home, look at your processes, and rethink the approach. Even if you’re a line-level administrator or engineer, your questions and process concerns should be welcome. You should be able to receive justification for why things are the way they are. Dig in and fight that bureaucracy. Make change happen, either to the computers or to the process.


December 6, 2016

Day 6 - No More On-Call Martyrs

Written By: Alice Goldfuss (@alicegoldfuss)
Edited By: Justin Garrison (@rothgar)

Ops and on-call go together like peanut butter and jelly. It folds into the batter of our personalities and gives it that signature crunch. It’s the gallows from which our humor hangs.

Taking up the pager is an ops rite-of-passage, a sign that you are needed and competent. Being on-call is important because it entrusts you with critical systems. Being on-call is the best way to ensure your infrastructure maintains its integrity and stability.

Except that’s bullshit.

The best way to ensure the stability and safety of your systems is to make them self-healing. Machines are best cared for by other machines, and humans are only human. Why waste time with a late night page and the fumblings of a sleep-deprived person when a failure could be corrected automatically? Why make a human push a button when a machine could do it instead?

If a company was truly invested in the integrity of its systems, it would build simple, scalable ones that could be shepherded by such automation. Simple systems are key, because they reduce the possible failure vectors you need to automate against. You can’t slap self-healing scripts onto a spaghetti architecture and expect them to work. The more complex your systems become, the more you need a human to look at the bigger picture and make decisions. Hooking up restart and failover scripts might save yourself some sleepless nights, but it wouldn’t guard against them entirely.

That being said, I’m not aware of any company with such an ideal architecture. So, if not self-healing systems, why not shorter on-call rotations? Or more people on-call at once? After all, going 17 hours without sleep can be equivalent to a blood alcohol concentration of 0.05%, and interrupted sleep causes a marked decline in positive mood. Why trust a single impaired person with the integrity of your system? And why make them responsible for it a week at a time?

Because system integrity is only important when it impacts the bottom line. If a single engineer works herself half-to-death but keeps the lights on, everything is fine.

And from this void, a decades-old culture has arisen.

There is a cult of masochism around on-call, a pride in the pain and of conquering the rotating gauntlet. These martyrs are mostly found in ops teams, who spend sleepless nights patching deploys and rebuilding arrays. It’s expected and almost heralded. Every on-call sysadmin has war stories to share. Calling them war stories is part of the pride.

This is the language of the disenfranchised. This is the reaction of the unappreciated.

On-call is glorified when it’s all you’re allowed to have. And, historically, ops folk are allowed to have very little. Developers are empowered to create and build, while ops engineers are only allowed to maintain and patch. Developers are expected to be smart; ops engineers are expected to be strong.

No wonder so many ops organizations identify with military institutions and use phrases such as “firefighting” to describe their daily grind. No wonder they craft coats of arms and patches and nod to each other with tales of horrendous outages. We redefine what it means to be a hero and we revel in our brave deeds.

But, at what cost? Not only do we miss out on life events and much-needed sleep, but we also miss out on career progression. Classic sysadmin tasks are swiftly being automated away, and if you’re only allowed to fix what’s broken, you’ll never get out of that hole. Furthermore, you’ll burn out by bashing yourself against rocks that will never move. No job is worth that.

There is only one real benefit to being on-call: you learn a lot about your systems by watching them break. But if you’re only learning, never building, you and your systems will stagnate.

Consider the following:

  1. When you get paged, is it a new problem? Do you learn something, or is it the same issue with the same solution you’ve seen half a dozen times?
  2. When you tell coworkers you were paged last night, how do they react? Do they laugh and share stories, or are they concerned?
  3. When you tell your manager your on-call shift has been rough, do they try to rotate someone else in? Do they make you feel guilty?
  4. Is your manager on-call too? Do they cover shifts over holidays or offer to take an override? Do they understand your burden?

It’s possible you’re working in a toxic on-call culture, one that you glorify because it’s what you know. But it doesn’t have to be this way. Gilded self-healing systems aside, there are healthier ways to approach on-call rotations:

  1. Improve your monitoring and alerting. Only get paged for actionable things in an intelligent way. The Art of Monitoring is a good place to start.
  2. Have rules in place regarding alert fatigue. The Google SRE book considers more than two pages per 12 hour shift too many.
  3. Make sure you’re compensated for on-call work, either financially or with time-off, and make sure that’s publicly supported by management.
  4. Put your developers on-call. You’ll be surprised what stops breaking.

For those of you who read these steps and think, “that’s impossible,” I have one piece of advice: get another job. You are not appreciated where you are and you can do much better.

On-call may be a necessary evil, but it shouldn’t be your whole life. In the age of cloud platforms and infrastructure as code, your worth should be much more than editing Puppet manifests. Ops engineers are intelligent, scrappy, and capable of building great things. You should be allowed to take pride in making yourself and your systems better, and not just stomping out yet another fire.

Work on introducing healthier on-call processes in your company, so you can focus on developing your career and enjoying your life.

In the meantime, there is support for weathering rough rotations. I started a hashtag called #oncallselfie to share the ridiculous circumstances I’m in when paged. There’s also The On-Call Handbook as a primer for on-call shifts and a way to share what you’ve learned along the way. And if you’re burned out, I suggest this article as a good first step toward getting back on your feet.

You’re not alone and you don’t have to be a martyr. Be a real hero and let the pager rest.

December 5, 2016

Day 5 - How to fight and fix spam. Common problems and best tools.

Written By: Pablo Hinojosa (@pablohn6)
Edited By: Brendan Murtagh (@bmurts)

The Best Tools to Combat and Fix Common Spam Problems

This article summarizes from a 30,000 foot view of what is spam, anti-spam, and how to fix common problems. This is not an article where you are going to find the command(s) to fix spam problems for your MTA. With the help of this article you will understand why you are suffering spam problems and how to identify the root cause. This article is not intended as a how-to, but provide a foundation for troubleshooting and implementing a configuration to help rectify a spam/bad reputation problem.

What is Spam?

Obviously you know what Spam is but, do you know what that represents in global terms? According to Kaspersky Lab Spam and Phishing in Q3 report, “Six in ten of all emails received are now unsolicited spam”. Imagine visiting ten webpages and six of them be unsolicited? What if they were phone calls, sms, or clients of your business? This century’s primary form of communication is email, business-related or not, communications are electronic. Do you imagine start each day 6 of each 10 conversations in an unsolicited way? Systems administrators responsibility is to change that number 10 by number 100, 1000 or as much zeroes you are able to reach.

One thing to know to help understand the scale of spam is that spam is a huge business. Unsolicited emails is one of the most common methods to promote hundreds of legal and illegal business. From a small bikes shop to a huge phishing or ransomware criminal network.

Spam is also a huge consumer of resources, both electronic and human. The SMTP protocol was designed from a naive perspective. Old protocol designers did not take much into account on how to cheat in a communication, which is why big providers design and implement several protocols to authenticate and limit cheating in email delivery (composed by 2 MTA exchanging messages). It is important to learn what those protocols are and how to properly configure them to not be flagged.

However, there are instances where our servers are sending actual Spam. Obviously this is not our intention but we need to quickly identify the issue and begin remediation immediately. In the next section, we will focus on how to detect the sending of spam and discuss techniques to resolve the problem.

Are you a spammer?

Whether or not you are a spammer is a matter of trust. The receiving MTA will question your trustworthiness and you will have to show your reasons. Let’s see what we have to configure to respond no and be trusted to that question.

The most important record is an MX record. As RFC 1034 states, it “identifies a mail exchange for the domain”. You can send emails and be trusted from non MX records servers, but the best way to be trusted (because it is the first thing to be checked) is to send emails from your MX records servers. This is not always possible or desirable. Sometimes another MTA is sending emails spoofing the sending domain. This is typically unauthorized which is why the “SPF record” was created. As RFC 7208 states, with an SPF record you can “explicitly authorize the hosts/IPs that are allowed to use your domain name, and a receiving host can check such authorization”. An SPF record is a TXT record you should create (I recommend this website ) to tell the world which servers are allowed to send on behalf of your domain name.

Some MTAs require more validation. They need the email signed by your MTA to trust you and then be able to verify that signature. This is implemnted by using DKIM. DKIM “defines a domain-level authentication framework for email using public-key cryptography and key server technology to permit verification of the source and contents of messages by either Mail Transfer Agents (MTAs)”. As Wikipedia says: DKIM resulted in 2004 from merging two similar efforts, “enhanced DomainKeys” from Yahoo and “Identified Internet Mail” from Cisco. The configuration of DKIM depends on your MTA and your OS. Generally speaking the steps include, but aren’t limited to generating public-key cryptographic keys, set up your MTA and a TXT record). A simple Google search for your MTA, OS, and DKIM should get you started. You can verify your configuration with this tool.

There are times when a MTA can say, hey! you are cheating me! I reject your email and you should know you are a cheater. That is why DMARC was created. “DMARC is a scalable mechanism by which a mail-originating organization can express domain-level policies and preferences for message validation, disposition, and reporting, that a mail-receiving organization can use to improve mail handling”. It basically uses SPF and DKIM records to make a decision and accept or reject (and notify if you wish) your email. It is just a TXT record, you can use this generator, but there are several tools to create, validate your DNS record or email and read DMARC reports.

If you have not had spam problems previously and you configure MX, SPF, DKIM and DMARC is 99% sure you are going to respond “no” to “Are you spammer?” question and you will be trusted. If you are not trusted, feel free to send me an email and I will help you figure out the reason why with that configuration you are not trusted. Be sure you check all your configuration is OK with this amazing tool. But, wait, what happens if you are a spammer?

You are spammer.

First of all, with “you” I mean your IP. And sometimes, usually in shared hosting services, you have the problem, but you are not the cause. Or worse, your IP is not sending spam now, but it did before. And spam problems are so serious that once you have been flagged for spamming, they do not easily give you the chance to be forgiven. It is all about a reputation. Your reputation is based on your IP spam problems history and even your IP range spam problems history. Yes, the IP 7 bits away from your IP is sending spam and your reputation could be affected. To find out, I recommend you visit this website.

Most of the time, the problem is on your IP and thus your ip is blacklisted. I recommend this tool to check if your IP is blacklisted. But be careful, sometimes you may appear blacklisted, but not because of sending actual spam, but you because do not respect some RFC. That is why this tool shows only main and most famous blacklist. If you are blacklisted you will have to:

  1. Be able to respond “no” to the “Are you a spammer?” question.
  2. Fix your “you are spammer” problems (locate the spam source and fix it).
  3. Request a delist to blacklist.

If you do just step 3, it will actually be worse because there is a strong possibility you will be blacklisted again, spam is so serious that blacklists sometimes only forgive you once, but sometimes not twice.

Often enough this scenario happens, when you send an email which is rejected or the email goes directly to the Receipent’s Spam folder. In the first case, NDR can sometimes tell you the reason (or the blacklist) for why they rejected the email. This level of detail depends on the receiving side’s MTA configuration.

However in second case, anti-spam software and major providers work in a different fashion. Typically service providers will flag your IP as a spammer, which results in all email originating from that host/IP go directly to the Spam folder or are rejected and it negatively affects to your “internal reputation”. The reputation of an email is calculated as a score using a mathematical formula in conjunction with pattern detection and defined rules that are analyzed by the mail server. This tool can show the score of your email content. This is very important when you are sending newsletters because they have a high probability to be marked as spam. This is why many companies and people use dedicated email marketing services like MailChimp and AWeber.

With major providers, that internal reputation depends on additional “secret” factors, but we could also say it helps when a person (not a robot or a mathematical formula) says: this is not spam. Do you remember that button?

If you are having Spam problems (mainly rejects) only with one provider, the next information will help you. If your problem is with Yahoo, you can use this form to say: hey please forgive me. Gmail also has this form. Microsoft (Outlook and Hotmail) has also this form but they also have an internal reputation tool to show you what do they think about you. They are named SNDS and JMRP, and if you are having problems with Hotmail (too often) they will help you a lot.

With small providers sometimes the best option is to send an email to postmaster requesting for a whitelist of your IP.

When you are sending too much spam, sometimes anti-spam software / services or major providers just reject your emails because they have no doubt you are a spammer. If from your MTA IP you cannot telnet to port 25 of MX record IP (timeout), you will not be able to send any email to them and then your emails will be queued. This is the worst symptom. Somebody can send emails to a provider, sometimes nobody can send emails to anybody, our telephone is ringing and everybody screaming. If you came here in that situation, I hope this article has helped you to understand how a serious of a problem spam can be. Remember to validate your configuration and always work as fast as possible to find the source of the spamming.

Locating a spam source is sometimes a hard, but necessary task for System Administrators. if you have read this far, you will understand how anti-spam technically works, so you will have more weapons to fight it. It is also a security research task, because usually a compromise has occurred to one of your clients or your server which was used to send unsolicited email all around of the world. In that case you will have to find the malicious code and also close the point of entry. In this situation, I suggest you to do the following:

  • Study your mail logs to find out if it is a single email account or not. If it is one email account, maybe malware or a cracked email password is the root cause. Changing the password may fix the problem. However, if other malware is still on the client email device, the password could be compromised again. A re-image is the safest method to ensure a clean device or machine.
  • If the FROM email is generated, that could be an internal malicious code generating Spam emails. You can create a wrapper for your MTA, OS and your platform stack to log the source of each email that is sent. For example this is a wrapper for Sendmail, Apache and PHP. Special attention if your platform is Wordpress or Joomla. Bots can try old bugs of non updated plugins or malicious free (but not free) themes to insert the malware.

As conclusion, we can say Spam is a huge problem that affects all email providers. That problem could be caused because a lack of configuration to increase ip reputation or because an actual spam sending due to malware. That is why it is important to take in account your ip reputation and also the security of your infrastructure to skip future problems.

Pablo Hinojosa is a Linux System Administrator that worked at Gigas Hosting Support Team assisting to thousands of clients affected by Spam.

December 4, 2016

Day 4 - Change Management: Keep it Simple, Stupid

Written By: Chris McDermott
Edited By: Christopher Webber (@cwebber)

I love change management. I love the confidence it gives me. I love the traceability–how it’s effectively a changelog for my environment. I love the discipline it instills in my team. If you do change management right, it allows you to move faster. But your mileage may vary.

Not everyone has had a good experience with change management. In caricature, this manifests as the Official Change Board that meets bi-monthly and requires all participants to be present for the full meeting as every proposed plan is read aloud from the long and complicated triplicate copy of the required form. Questions are asked and answered; final judgements eventually rendered. Getting anything done takes weeks or months. People have left organizations because of change management gone wrong.

I suppose we really should start at the beginning, and ask “Why do we need change management at all?” Many teams don’t do much in the way of formal change process. I’ve made plenty of my own production changes without any kind of change management. I’ve also made the occasional human error along the way, with varying degrees of embarrassment.

I challenge you to try a simple exercise. Start writing down your plan before you execute a change that might impact your production environment. It doesn’t have to be fancy – use notepad, or vim, or a pad of paper, or whatever is easiest. Don’t worry about approval or anything. Just jot down three things: step-by-step what you’re planning to do, what you’ll test when you’re done, and what you would do if something went wrong. This is all stuff you already know, presumably. So it should be easy and fast to write it down somewhere.

When I go through this exercise, I find that I routinely make small mistakes, or forget steps, or realize that I don’t know where the backups are. Most mistakes are harmless, or they’re things that I would have caught myself as soon as I tried to perform the change. But you don’t always know, and some mistakes can be devastating.

The process of writing down my change plan, test plan, and roll-back plan forces me to think through what I’m planning carefully, and in many cases I have to check a man page or a hostname, or figure out where a backup file is located. And it turns out that doing all that thinking and checking catches a lot of errors. If I talk through my change plan with someone else, well that catches a whole bunch more. It’s amazing how much smarter two brains are, compared to just one. Sometimes, for big scary changes, I want to run the damn thing past every brain I can find. Heh, in fact, sometimes I show my plan to people I’m secretly hoping can think of a better way to do it. Having another human being review the plan and give feedback helps tremendously.

For me, those are the really critical bits. Write down the complete, detailed plan, and then make sure at least one other person reviews it. There’s other valuable stuff you can do like listing affected systems and stakeholders, and making notification and communication part of the planning process. But it’s critical to keep the process as simple, lightweight, and easy as possible. Use a tool that everyone is already using – your existing ticketing software, or a wiki, or any tool that will work. Figure out what makes sense for your environment, and your organization.

When you can figure out a process that works well, you gain some amazing benefits. There’s a record of everything that was done, and when, and by whom. If a problem manifests 6 or 12 or 72 hours after a change was made, you have the context of why the change was made, and the detailed test plan and roll-back plan right there at your fingertips. Requiring some level of review means that multiple people should always be aware of what’s happening and can help prevent knowledge silos. Calling out stakeholders and communication makes it more likely that people across your organization will be aware of relevant changes being made, and unintended consequences can be minimized. And of course you also reduce mistakes, which is benefit enough all by itself. All of these things combined allow high-functioning teams to move faster and act with more confidence.

I can give you an idea of what this might look like in practice. Here at SendGrid, we have a Kanban board in Jira (a tool that all our engineering teams were already using when we rolled out our change management process). If an engineer is planning a change that has the potential to impact production availability or customer data, they create a new issue on the Change Management Board (CMB). The template has the following fields:

  • Summary
  • Description
  • Affected hosts
  • Stakeholders
  • Change plan
  • Test plan
  • Roll-back plan
  • Roll-back verification plan
  • Risks

All the fields are optional except the Summary, and several of them have example text giving people a sample of what’s expected. When the engineer is happy with the plan, they get at least one qualified person to review it. That might be someone on their team, or it might be a couple of people on different teams. Engineers are encouraged to use their best judgement when selecting reviewers. Once a CMB has been approved (the reviewer literally just needs to add a “LGTM” comment on the Jira issue), it is dragged to the “Approved” column, and then the engineer can move it across the board until they’re done with the change. Each time the CMB’s status in Jira changes, it automatically notifies a HipChat channel where we announce things like deploys. For simple changes, this whole process can happen in the space of 10 or 15 minutes. More complicated ones can take a day or two, or in a few cases weeks (usually indicative of complex inter-team dependencies). The upper bound on how long it has taken is harder to calculate. We’ve had change plans that were written and sent to other teams for review, which then spawned discussions that spawned projects that grew into features or fixes and the original change plan withered and died. Sometimes that’s the the better choice.

I don’t think we have it perfect yet; we’ll probably continue to tune it to our needs. Ours is just one possible solution among many. We’ve tried to craft a process that works for us. I encourage you to do the same.

December 3, 2016

Day 3 - Building Empathy: a devopsec story

Written By: Annie Hedgpeth (@anniehedgie)
Edited By: Kerim Satirli (@ksatirli)

’Twas the night before Christmas, and all through the office not a creature was stirring … except for the compliance auditors finishing up their yearly CIS audits.

Ahh, poor them. This holiday season, wouldn’t you love to give your security and compliance team a little holiday cheer? Wouldn’t you love to see a bit of peace, joy, and empathy across organizations? I was lured into technology by just that concept, and I want to share a little holiday cheer by telling you my story.

I’m totally new to technology, having made a pretty big leap of faith into a career change. The thing that attracted me to technology was witnessing this display of empathy firsthand. My husband works for a large company who specializes in point-of-sale software, and he’s a very effective driver of DevOps within his organization. He was ready to move forward with automating all of the things and bringing more of the DevOps cheer to his company, but his security and compliance team was, in his eyes, blocking his initiatives - and for good reason!

My husband’s year-long struggle with getting his security and compliance team on board with automation was such an interesting problem to solve for me. He was excited about the agile and DevOps methodologies that he had adopted and how they would bring about greater business outcomes by increasing velocity. But the security and compliance team was still understandably hesitant, especially with stories of other companies experiencing massive data breaches in the news with millions of dollars lost. I would remind my husband that they were just trying to do their jobs, too. The security and compliance folks aren’t trying to be a grinch. They’re just doing their job, which is to defend, not to intentionally block.

So I urged him to figure out what they needed and wanted (ENTER: Empathy). And what he realized is that they needed to understand what was happening with the infrastructure. I can see how all of the automated configuration management could have caused a bit of hesitation on behalf of security and compliance. They wanted to be able to inspect everything more carefully and not feel like the automation was creating vulnerability issues that were out of their control.

But the lightbulb turned on when they realized that they could code their compliance controls with a framework called InSpec. InSpec is an open-source framework owned by Chef but totally platform agnostic. The cool thing about it is that you don’t even need to have configuration management to use it, which makes it a great introduction to automation for those that are new to DevOps or any sort of automation.

(Full-disclosure: Neither of us works for Chef/InSpec; we’re just big fans!)

You can run it locally or remotely, with nothing needing to be installed on the nodes being tested. That means you can store your InSpec test profile on your local machine or in version control and run it from the CLI to test your local machine or a remote host.

# run test locally
inspec exec test.rb

# run test on remote host on SSH
inspec exec test.rb -t ssh://user@hostname

# run test on remote Windows host on WinRM
inspec exec test.rb -t winrm://Administrator@windowshost --password 'your-password'

# run test on Docker container
inspec exec test.rb -t docker://container_id

# run with sudo
inspec exec test.rb --sudo [--sudo-password ...] [--sudo-options ...] [--sudo_command ...]

# run in a subshell
inspec exec test.rb --shell [--shell-options ...] [--shell-command ...]

The security and compliance team’s fears were finally allayed. All of the configuration automation that my husband was doing had allowed him to see his infrastructure as code, and now the security and compliance team could see their compliance as code, too.

They began to realize that they could automate a huge chunk of their PCI audits and verify every time the application or infrastructure code changed instead of the lengthy, manual audits that they were used to!

Chef promotes InSpec as being human-readable and accessible for non-developers, so I decided to learn it for myself and document on my blog whether or not that was true for me, a non-developer. As I learned it, I became more and more of a fan and could see how it was not only accessible, but in a very simple and basic way, it promoted empathy between the security and compliance teams and the DevOps teams. It truly is at the heart of the DevSecOps notion. We know that for DevOps to deliver on its promise of creating greater velocity and innovation that silos must be broken down. These silos being torn down absolutely must include those of the security and compliance teams. The InSpec framework does that in such a simple way that it is easy to gloss over. I promise you, though, it doens’t have to be complicated. So here it is…metadata. Let me explain.

If you’re a compliance auditor, then you’re used to working with PDFs, spreadsheets, docs, etc. One example of that is the CIS benchmarks. Here’s what a CIS control looks like.

And this is what that same control looks like when it’s being audited using InSpec. Can you see how the metadata provides a direct link to the CIS control above?

control "cis-1-5-2" do
  impact 1.0
  title "1.5.2 Set Permissions on /etc/grub.conf (Scored)"
  desc "Set permission on the /etc/grub.conf file to read and write for root only."
  describe file('/etc/grub.conf') do
    it { should'owner') }
    it { should'owner') }

And then when you run a profile of controls like this, you end up with a nice, readable output like this.

When security and compliance controls are written this way, developers know what standards they’re expected to meet, and security and compliance auditors know that they’re being tested! InSpec allows them to speak the same language. When someone from security and compliance looks at this test, they feel assured that “Control 1.5.1” is being tested and what its impact level is for future prioritization. They can also read plainly how that control is being audited. And when a developer looks at this control, they see a description that gives them a frame of reference for why this control exists in the first place.

And when the three magi of Development, Operations, and Security and Compliance all speak the same language, bottlenecks are removed and progress can be realized!

Since I began my journey into technology, I have found myself at 10th Magnitude, a leading Azure cloud consultancy. My goal today is to leverage InSpec in as many ways as possible to add safety to 10th Magnitude’s Azure and DevOps engagements so that our clients can realize the true velocity the cloud makes possible.

I hope this sparked your interest in InSpec as it is my holiday gift to you! Find me on Twitter @anniehedie, and find much more about my journey with InSpec and technology on my blog.

December 2, 2016

Day 2 - DBAs, a priesthood no more

Written by: Silvia Botros (@dbsmasher)
Edited by: Shaun Mouton (@sdmouton)
Header Image
Hermione casting a spell.
Illustration by Frida Lundqvist

Companies have had and needed Database Administrators for years. Data is one of a business’s most important assets. That means many businesses, once they grow to the point where they must be able to rapidly scale, need someone to make sure that asset is well managed, performant for the product needs, and available to restore in case of disasters.

In a traditional sense, the job of the DBA means she is the only person with access to the servers that host the data, the person to go to create new database cluster for new features, the person to design new schemas, and the only person to contact when anything database related breaks in a production environment.

Because DBAs traditionally have such unique roles their time is at a premium, and it becomes harder to think big picture when day to day tasks overwhelm. It is typical to resort to brittle tools like bash for all sorts of operational tasks in DBA land. Need a new DB setup from a clean OS install? Take, validate, or restore backups? Rotate partitions or stale data? When your most commonly used tool is bash scripting, everything looks like a nail. I am sure many readers are preparing tweets to tell me how powerful bash is, but please hold your commentary until after evaluating my reasoning.

Does all this sound like your job description as a DBA? Does the job description talk in details about upgrading servers, creating and testing backups, and monitoring? Most typical DBA job postings will make sure to say that you have to configure and setup ‘multiple’ database servers (because the expectation is that DBAs hand craft them), and automate database management tasks with (hand crafted) scripts.

Is that really a scalable approach for what is often a team of one in a growing, fast paced organization?

I am here to argue that your job is not to perform and manage backups, create and manage databases, or optimize queries. You will do all these things in the span of your job but the primary goal is to make your business’s data accessible and scalable. This is not just for the business to run the current product but also to build new features and provide value to customers.


You may want to ask, why would I do any of this? There is an argument for continuing the execute the DBA role traditionally: job security, right?

Many tech organizations nowadays do one or more of the following:
  • They are formed of many smaller teams
  • They provide feature by creating many micro-services in place of one or a few larger services
  • They adopt agile methodologies to speed the delivery of features
  • They combine operations and engineering under one leadership
  • They embed operations engineers with developers as early as possible in the design process
A DBA silo within operations means the operations team is less empowered to help debug production issues in its own stack, is sometimes unable to respond and fix issues without assistance, and frankly less credible at demanding closer and earlier collaborations with the engineering teams if they aren’t practicing what they preach inside Tech Ops.

So what can be done to bust that silo and make it easier for other folks to debug, help scale the database layer, and empower engineers to design services that can scale? Most up-and coming shops have at most one in-house DBA. Can the one DBA be ‘present’ in all design meetings, approve every schema change, and be on call for a sprawling, ever growing database footprint?

DBAs can no longer be gate keepers or magicians. A DBA can and should be a source of knowledge and expertise for engineers in an organization. She should help the delivery teams not just deliver features but to deliver products that scale and empower them to not fear the database. But how can a DBA achieve that while doing the daily job of managing the data layer? There are a number of ways you, the DBA, can set yourself up for excellence.

Configuration management

This is a very important one. DBAs tend to prefer old school tools like bash for database setup. I alluded to this earlier and I have nothing against using bash itself. I use it a lot, actually. But it is not the right tool for cluster setup. Especially if the rest of ops is NOT using Bash to manage the rest of the architecture. It’s true that operations engineers know Bash too, but if they are managing the rest of the infrastructure with a tool like Chef or Puppet and the databases are managed mostly by hand crafted scripts written by the DBA, you are imposing an obstruction for them to provide help when an urgent change is needed. Moreover, it becomes harder to help engineering teams to self-serve and own the creation of the new clusters they need for new feature foo. You become the ‘blocker’ for completing work. Getting familiar with the configuration management at your company is also a two way benefit. As you get familiar with how the infrastructure is managed, you get to know the team’s standards, get more familiar with the stack, and are able to collaborate on changes that ultimately affect the product scale. A DBA who is tuned into the engineering organization’s product and infrastructure as a whole is invaluable.


This is technically a subset of the documentation you have to write (you document things, right?!) but in my experience has proven far more useful that I feel it has to be pointed out separately. When I say runbooks I am specifically saying a document written for an audience that is NOT a DBA. There are a lot of production DB issues we may encounter as DBAs that are simple for us to debug and resolve. We tend to underestimate that muscle memory and we fall in the pattern of ‘just send me the page’ and we ‘take care of things’.

If your operations team is like mine where you are the only DBA, it probably means someone else on the team is the first line of defense when a DB related event pages. Some simple documentation on how to do initial debugging, data collection, can go a long way in making the rest of the operations team comfortable with the database layer and more familiar with how we monitor it and debug it. Even if that event still results into paging the DBA, slowly but surely, the runbook becomes a place for everyone to add acquired knowledge.

Additionally, I add a link to the related runbook section (use anchors!) to the page descriptions that go to the pager. This is incredibly helpful for someone being paged by a database host at 3 AM to find a place to start. These things may seem small but in my experience they have gone a long way breaking mental barriers for my operations team working on database layer when necessary.

As a personal preference, I write these as markdown docs inside my Chef cookbook repositories. This falls seamlessly into a pull request, review and merge pattern, and it becomes an integral part of the databases’ cookbooks pattern. As engineering teams start creating their own, the runbooks become a familiar template as new database clusters spring out all over the place.


We like our terminal screens. We love them. The most popular tools in MySQL land are still terminal tools that live directly on the db hosts and that need prior knowledge of them and how to use them. I am talking about things like innotop and the MySQL shell. These are fine and still helpful but they are created for DBAs. If you do not want to be the gatekeeper to questions like “is there replication lag right now”, you need to have better tools to make any cluster health, now and historically, available and easy to digest for all team members. I have a few examples in this arena:


We use read replicas to spread that load away from the primary, which means once lag hits a certain threshold, it becomes a customer support event. It is important to make it easier for anyone in the company to know at any given time whether any cluster is experiencing lag, what servers in that cluster are, and whether any of the hosts has gone down. Orchestrator is a great tool in that regard in that it makes visualizing clusters and their health a browser window away.


Metrics for the DB layer need to live in the same place metrics for the rest of the infrastructure are. It is important for the team to be able to juxtapose these metrics side by side. And it is important to have an easy way to see historical metrics for any DB cluster. While you may have a personal preference for cacti or munin, or artisanal templates that you have written over the years, if the metrics you use to investigate issues are not in the same place as the rest of the infrastructure metrics it sets up a barrier for other busy engineers, and they’ll be less inclined to use your tooling over that which is in use elsewhere. Graphite is in wide use for ingesting metrics in modern infrastructure teams, and Grafana is a widely used dashboarding front-end for metrics and analytics.

Query performance

We use VividCortex to track our queries on critical clusters and while this article isn’t meant to be an advertisement for a paid service, I will say that you need to make the ability to inspect the effect of deploys and code changes on running queries and query performance something that doesn’t need special access to logs and manually crunching them. If VividCortex isn’t a possibility (although, seriously, they are awesome!), there are other products and open source tools that can capture even just the slow log and put it in an easy to read web page for non DBAs to inspect and see the effect of their code. The important point here is that if you provide the means to see the data, engineers will use that data and do their best to keep things efficient. But it is part of your job to make that access available and not a special DBA trick.

Fight the pager fatigue

A lot of organizations do not have scaling the database layer as a very early imperative in their stack design, and they shouldn’t. In the early days of a company, you shouldn’t worry about how you will throttle API calls if no one is using the API yet. But it’s appropriate to consider a few years later, when the product has gained traction, and that API call that was hitting a table of a few thousand rows by a handful of customers is now a multi million rows table, and a couple of customers have built cron jobs that flood that API every morning at 6 AM your timezone.

It takes a lot of work to change the application layer of any product to protect the infrastructure and in the interim, allowing spurious database activity to cause pager fatigue is a big danger to both you and the rest of the operations organization. Get familiar with tools like pt-kill that can be used in a cinch to keep a database host from having major downtime due to unplanned volume. Make the use of that tool known and communicate the action and its effect to the stakeholder engineering team but it is unhealthy to try and absorb the pain from something you directly cannot change and it is ultimately not going to be beneficial to helping the engineering teams’ learn how to deal with growing pains.

There are a lot of ways a DBA’s work is unique to her role in comparison to the rest of the operations team but that doesn’t mean it has to be a magical priesthood no one can approach. These steps go a long way in making your work transparent but most importantly is approaching your work not as a gatekeeper to a golden garden of database host but as a subject matter expert who can provide advice and help grow the engineers you work with and provide more value to the business than backups and query tuning (but those are fun too!).

Special thanks to the wonderful operations team at Sendgrid who continue to teach me many things, and to Charity Majors for coining the title of this post.

December 1, 2016

Day 1 - Why You Need a Postmortem Process

Written by: Gabe Abinante (@gabinante)
Edited by: Shaun Mouton (@sdmouton)

Why Postmortems?

Failure is inevitable. As engineers building and maintaining complex systems, we likely encounter failure in some form on a daily basis. Not every failure requires a postmortem, but if a failure impacts the bottom line of the business, it becomes important to follow a postmortem process. I say “follow a postmortem process” instead of “do a postmortem”, because a postmortem should have very specific goals designed to prevent future failures in your environment. Simply asking the five whys to try and determine the root cause is not enough.

A postmortem is intended to fill out the sort of knowledge gaps that inevitably exist after an outage:
  1. Who was involved / Who should have been involved?
  2. Was/is communication good between those parties?
  3. How exactly did the incident happen, according to the people who were closest to it?
  4. What went well / what did we do right?
  5. What could have gone better?
  6. What action items can we take from this postmortem to prevent future occurrence?
  7. What else did we learn?
Without a systematic examination of failure, observers can resort to baseless speculation.

Without an analysis of what went right as well as what went wrong, the process can be viewed as a complete failure.

Without providing key learnings and developing action items, observers are left to imagine that the problem will almost certainly happen again.

A Case Study of the Knight Capital critical SMARS error 2012

Knight Capital was a financial services firm engaging in high frequency trading in the New York Stock Exchange and NASDAQ. It posted revenue of 1.404 billion in 2011, but went out of business by the end of 2012.

On August 1, 2012, Knight Capital deployed untested software which contained an obsolete function to a production environment. The incident happened due to an engineer deploying new code to only 7/8 of the servers responsible for Knight’s automated routing system for equity orders. The code repurposed a flag that was formerly used to activate an old function known as “Power Peg”, which was designed to move stock prices higher and lower in order to verify the behavior of trading algorithms in a controlled environment. All orders sent with the repurposed flag to one of the servers triggered the obsolete code still present on that server. As a result, Knight’s trading activities caused a major disruption in the prices of 148 companies listed at the New York Stock Exchange. This caused the prices of certain stocks to jump by as much as 1200%. For the incoming parent orders that were processed by the defective code, Knight Capital sent millions of child orders, resulting in 4 million executions in 154 stocks for more than 397 million shares in approximately 45 minutes (1). Knight Capital took a pre-tax loss of $440 million. Despite a bailout the day after, this precipitated the collapse of Knight Capital’s stock, losing 75% of their equity value.

I chose to write about this incident because there is an incredible body of writing about it, but actually remarkably little information or substance beyond the SEC release. The amount of material is certainly partially because the incident had such a high impact - few companies have a technical glitch that puts them out of business so quickly. I believe that there’s more to it however - this type of response is an attempt by the community to make sense of the incident because the company itself never released a public postmortem. This is an incredibly interesting case because a production bug and operational failure actually perpetuated the collapse of a seemingly successful business - but the lack of a public postmortem exposed the company to all kinds of baseless speculation about lackadaisical attitudes towards change controls, testing, and production changes (see various citations, especially 11, 12). It would also seem that there was not an internal postmortem, or that it was not well circulated, based upon the Knight Capital CEO’s comments to the press (2).

As the John Allspaw notes in his blog (3), one of the worst consequences of Knight’s reticence was news companies and bloggers using the SEC investigation as a substitute for a postmortem. This was harmful to the business and particularly to the engineers involved in the incident. The SEC document is blamey. It’s supposed to be blamey. It details the incident timeline and outlines procedures that should have been in place to prevent an error - and in doing so it focuses entirely on what was lacking from their outside perspective. What it doesn’t do is accurately explain how the event came to be. What processes WERE in place that the engineers relied upon? What change controls WERE being used? What went right and what will be done to ensure this doesn’t happen in the future?

Did Knight Capital go out of business because they lost a bunch of money in a catastrophic way? Sure. But their core business was still a profitable model - it’s conceivable that they could have received a bailout, continued operations, and gotten out of the hole created by this massive failure. Unfortunately, they failed to demonstrate to their investors and to the public that they were capable of doing so. By failing to release a public document, they allowed the narrative to be controlled by news sites and bloggers.

Taking a look at IaaS provider outages

Infrastructure providers are in a unique position where they have to release postmortems to all of their customers for every outage, because all of their customers’ business systems rely upon IaaS uptime.

AWS experienced an outage that spanned April 21st-April 24th, 2011 and brought down the web infrastructure of several large companies such as Quora and Hootsuite. The incident began when someone improperly executed a network change and shunted a bunch of traffic to the wrong place, which cut a ton of nodes off from each other. Because so many nodes were affected at one time, all of them trying to re-establish replication and hunt for free nodes caused the entire EBS cluster to run out of free space. This generated a cascading failure scenario that required a massive amount of storage capacity in order to untangle. Recovery took quite a while because capacity had to be physically added to the cluster. The postmortem was published via Amazon’s blog on April 29th, 2011. This incident is notable because it was somewhat widespread (affected multiple availability zones) and resolution took longer than 24 hours - making it one of the largest outages that AWS has experienced. AWS has a response pattern that is characterized by communication throughout; updates to status page during the incident, followed by a detailed postmortem afterwards (4). Amazon’s postmortem structure seems to be consistent across multiple events. Many seem to use roughly this outline:
  1. Statement of Purpose
  2. Infrastructure overview of affected systems
  3. Detailed recap of incident by service
  4. Explanation of recovery steps & key learnings
  5. Wrap-up and conclusion
From this we can learn two things: Firstly, we know that amazon has a postmortem process. They are pursuing specific goals around analyzing the failure of their service. Secondly, we know what they want to communicate. Primarily, they want to explain why the failure occurred and why it will not happen again in the future. They also provide an avenue for disgruntled stakeholders to reach out, receive compensation, get additional explanation, etc.

Azure experienced a similar storage failure in 2014 and we see a similar response from them - immediate communication via status pages, followed by a postmortem after the incident (5).

Taking a look at how the media approaches these failure events, it’s worthy of note that the articles written about the outages include links to the postmortem itself, as well as status pages and social media (6,7). Because the companies are communicative and providing documentation about the problem, the journalist can disseminate that information in their article - thus allowing the company that experienced the failure to control the narrative. Because so much information is supplied, there’s very little speculation about what went right or wrong on the part of individuals or journalists, despite the outage events impacting a huge number of individuals and companies utilizing the services themselves or software which relied upon them.


So, while postmortems are often considered a useful tool only from an engineering perspective, they are critical to all parts of a business for four reasons:

  • Resolving existing issues causing failures
  • Preventing future failures
  • Controlling public perception of the incident
  • Helping internal business units and stakeholders to understand the failure

Having a process is equally critical, and that process needs to be informed by the needs of the business both internally and externally. A process helps ensure that the right questions get asked and the right actions are taken to understand and mitigate failure. With the typical software development lifecycle increasing in speed, availability is becoming more of a moving target than ever. Postmortem processes help us zero in on threats to our availability and efficiently attack problems at the source.

About the Author

My name is Gabe Abinante and I am an SRE at ClearSlide. You can reach me at or @gabinante on twitter. If you are interested in developing or refining a postmortem process, check out the Operations Incident Board on GitHub:

Citations / Additional Reading: