Why panicked fixes are dangerous

A few months ago, I had a blog post about “panicked fixes“. I said they were bad, but I guess I wasn’t quite clear enough about why. Ironically, I’m the kind of guy who, if you tell me “Don’t do this. It’s bad or dangerous”, I’ll be cautious, but I won’t be satisfied to simply take your word for it. I need to know why. Like, “What’s the worst that could happen?” and “How likely will that happen to me?”.

Let me walk you through a time when I made a panicked fix and the rewards that it brought me: I was responsible for maintaining an important web site. It broke. People were freaking out and acting like I own a magic-wand that fixes web sites, plus maybe ESP, because I’m supposed to just immediately fix it, before looking at anything. To them, I appeared to waste ten minutes, reading error logs while everyone ran around screaming. My calm (focused) demeanor gave them a mixed message: apparently, I didn’t seem to appreciate the true gravity of the situation. So several people tried to motivate me to move a little faster and take this more seriously, or something.

Eventually, you could say “they got through to me”. I felt pressure to [just try something already], instead of just-sitting-there (thinking about it). “Doing something” did seem to soothe them. They felt better when I said stuff like “Okay, try it now”. They said “nope” but then they seemed less edgy for a few minutes. So I kept doing that, even though it felt wrong inside.

Eventually I did figure out the real problem, but only after trying a dozen wrong things.

The bad-news was that some of the web site was working now but half seemed screwed-up, in various new ways. So I went back and repeated the same pattern. “Try it now” I would announce. “Nope”.
I’m ashamed to admit that this actually went-on for 2 days until everything was back to normal.

Afterwards, I had to deliver some root-cause-analysis, of 1) why did it break in-the-first-place, and 2) why everything was so messed up for two goshdarn days.

I sort-of had to guess which one of my fixes actually resolved the problem, because I had tried twenty things until it eventually worked. I can undoubtedly tell you which of those twenty things wound-up breaking other stuff. Because I had to undo the damage from each. Which is what took two days. Not only did I troubleshoot the original problem, I also had to troubleshoot a half dozen other problems that (it turns out) were caused by me, and my reckless attempts to guess-at the solution. Those guesses didn’t fix anything, but I didn’t know that, because the site was already broken, so I couldn’t tell if I was making it better or breaking it even worse.

Lessons learned:

Prepare for disasters, so when things go bad you can follow a play-book and not go all cowboy
* Have a plan to accomplish divide & conquer in-the-best-way.
* If you need to quickly acquire permissions to prod, have contact info handy and keep a good relationship with those contacts, so they trust & believe you when you say “it’s an emergency”.
Make sure your management chain is aware of the plan. So they will have-your-back and not join-the-crowd who is screaming at you.
Keep people informed periodically about what you are doing. It seems like a waste of time, but it keeps people out of your ear for a few minutes, so you can concentrate.
* Make a quick initial assessment and give an estimate on “how soon till you can give a better/more-accurate ETA”. Don’t promise anything until you have time to assess your situation
* Put up a “System is down” web page or something to show that you are aware of the problem and are working on it. Ask your boss to “run interference” and keep folks away so you can concentrate.
* Divide-and-conquer – Split the problem in-half and do some quick checks: { Can I terminal into the database, web server, other servers? When do the logs show that it died? Are there any (event) error-logs (on the servers, SAN, VM host) just before the outage? } Give your boss an update of what you found so folks know you are doing your job.
Don’t be part of the problem
* Understand the problem before trying stuff
* Haste makes waste. Try (1) one thing, check it, if it didn’t work, undo it. Repeat.
* Record what you are doing (voice recorder, screen recorder, anything), so you can review it later
* Don’t give management & other support staff any reason to think you are some fool who cries-wolf or might try some cowboy crap when things go bad. Earn that trust early and never-ever betray that trust.