Deployments failing? How to stop nuking your prod.

You’ve been there before.  Your team has busted their backs to get a bunch of fixes and features done.  You are ready to deploy to prod.  You pass your gut-check and roll it out.  The servers don’t crash initially, and you hold your breath waiting for the other shoe to drop.  Silence for 10 minutes, 30 minutes, 45 minutes and then the complaints come-in like the tide.

A deployment mix-up, once in a while (1-2x per year), is not that unusual.  However, if mayhem is starting to seem more like the norm instead of the exception, it will eat away at you.  Deployments are supposed to be a time of joy and celebration, like the harvest.  However, if the crops are wormy or rotten, there is no celebration.

Steps Leading Up To Deployment

First, let’s look at the actions that led up to the deployment failure:

  1. Development – Developers wrote some fixes, but maybe they didn’t do the best job of logging every single change, especially on the database.  Gosh, there were so many changes.  It is really hard to keep track of EVERY SINGLE ONE.  However, if one single change gets missed in your deployment, it could scuttle the whole thing.  Honestly, this probably is your whole problem.
  2. Test – Testing confirmed every one of those fixes, but we didn’t end the testing cycle with a thorough test to cover every square inch of the app.  We are simply trusting that there were no side-effects from any of those fixes.  I’m sure it is okay.  Golly, there was so much testing going-on, I’m sure someone would have noticed and reported any side-effects.  Unless they couldn’t tell that it was a real problem or if they assumed that somebody else was working on it.
  3. QA – It worked in QA.  It would be so much easier if we could just copy the QA code and database up to prod.  Except that prod is already full of its own data and QA has a bunch of bogus data for imaginary customers, like Bart Simpson and Harry Butz.  Eek!  We can’t take a chance on any of our test data going to prod.
  4. Dependable build tools – We used RedGate to identify the differences between the two environments.  Doing it by hand would take FOREVER.  So, the tool has done all of the thinking for us.  If things go bad, we can blame the tool.  What other choice do we have?
  5. Deployment – The build for Prod compiled, no problem.  If thing go bad, we need to be able to undo this deployment.  I’ll just ask the server admins if they made a backup and then it becomes their responsibility.  I’ll rip-off the bandaid and if blood comes spraying-out, then the sys-admin will repair the artery.  They better have their “A game” ready.  Otherwise, I’ll be glad that I’m not down-wind with those guys.

Okay, some of those sounded a little exaggerated, but it probably was scary how accurate some of them sounded.

I could go over this list, step-by-step and identify how to fix each.  Some of the problems were pretty obvious.  I’ll save you the time, because you probably could guess what could be done to improve each of these.  The only down-side is that it would take a lot of discipline to fix any of them (consistently).  Multiply that by five and you start to lose hope.

*sigh* If only there was one single thing that could be done to fix all of them, or at least protect your prod server from this mayhem.

Solution

The solution is subtle and you might not recognize the value of this solution until you try it and it saves your back-side:  Add a “staging” server.

The entire process (steps 1-4, above) is run by developers.  Those salty pirates of 1s and 0s, who learn to speak Klingon and Elvish in their spare time (for fun), are running most of the show.  However, when it comes to step #5, repeat after me: DO NOT LET DEVELOPERS TOUCH PROD!  Never, never!  Unless the world is at DEFCON 1, you better keep those scurvy dogs out of your Prod server.

The last step for the developers should be: put-together a deployment package with all of the deployment files and scripts (etc) and a document that describes how to deploy them.  The end.  After that, the developers hand it over to a server admin (non-developer) and wait to see if it is a baby boy or girl.

At this point, your server admin should already have 2 materials:

  1. A fake prod environment (known as a “staging” server).  It should be running on a VM.  It should be nearly identical to prod in every way, but is not connected to your network and cannot interact with other machines in the network (so it doesn’t inject, alter or delete vital data from any REAL servers).  Your admin will confirm that it is working before you do a deployment test.
  2. A test plan.  This plan will confirm that all of the critical sub-systems are working after the deployment package has been installed.  Then, it will dig down into the darkest depths of your software and confirm that everything is working, as expected.

Here are the official steps for using these two crucial tools:

  1. The admin runs (turns-on) the staging VM
  2. The admin restores backups (or refreshes snapshots, etc) from Prod, onto the staging servers.  This also is a valid test of your backups. (a really good idea)
  3. The admin confirms that the staging environment is as close to prod as possible and everything works.  This is critical, because if your practice-deployment fails, you want to be sure it was a failed deployment and not because your staging server was already broken before you even started, and you waste a day or two chasing down the wrong problem.
  4. The admin makes snapshot(s) of the VM (in case the deployment fails, it is much easier to roll-back the VMs instead of wiping & reinstalling).
  5. The admin executes the deployment plan on the staging server.
  6. The admin tests (or has tests performed by an assistant) to confirm that the deployment was a success or failure.

If all goes well, then the prod deployment will be a slam-dunk.  Easy cheesy!

Here are some possible caveats to this process:

  • If the deployment to staging fails and the developers are baffled – It is okay to have the developers look at the carnage on the staging environment and even fiddle with it, AFTER testing is completed (on staging, not prod).  The developers can fix their stuff and try again, but this means that the sys-admin needs to start all-over again: roll-back the VM snapshot.  The testing doesn’t count until the sys-admin can perform the deployment, unassisted, using only the files and docs that were provided to him.
  • If everything went well, but a bug still slipped-through to prod – This is a tough one because, it means your tests didn’t catch it.  Your only choices are to improve your tests (and spend more resources on testing) or simply accept the fact that this will happen periodically.  Keep in mind that the more thoroughly you test, before going to prod, the more likely you are to catch one of these rascals.
  • If Prod integrates with so many systems, it is (nearly) impossible to emulate them all – Sometimes, there needs to be compromise and being reasonable.  However, the more closely you can emulate the real thing, the safer you will be.  I’ve had to write emulators (programs) for a mainframe so my programs could be tested (reasonably).  It was weird and felt a little wasteful, but it worked and our tests were better, as a result.  You get what you pay for.

This process has saved my back-side so many times, I can’t even count them all.  It is hard to believe how well this works until you try it.

So, if you are ready to stop nuking your prod, every time you do a deployment, leet-up and commit to this process.

Advertisements

About Tim Golisch

I'm a geek. I do geeky things.
This entry was posted in Architecture, Methodology, Professionalism. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s