With the end in mind

Steven Covey says the 2nd habit of highly effective people is to begin with the end in-mind. This is also one of the foundational/core skills of a senior developer. The last 10% of a project can be the beginning of a nightmare if you don’t know what to expect or if you haven’t prepared for it properly.

So, what could happen in that last 10% that is so difficult, why does it happen (or should) and what can you do to prevent or prepare for bad stuff?

Setting expectations

Nearly every software development methodology will tell you that you need a “definition of done”. It is important because it gives everyone a chance to agree upon the outcome, and know (definitively) when you’ve reached your goal. It becomes obvious what happens if the goal is moved (at all), and the team navigates steadily towards a clear objective. If anyone was un-clear (at any point), this is how/when we clarify things. Sometimes, it may require some negotiating, or maybe just some coping. Get it all resolved early in a project, before you’ve dug-in enough to make it difficult to adjust.

I know “done” sounds like the end. It isn’t. The “next step” after “done” also fits into that 10% of “the end” and needs to be discussed and maybe negotiated (or um, dictated). If you haven’t talked about what happens before and after you’ve “reached your goal”, then you haven’t properly set expectations, which means they are almost guaranteed to be misaligned. You need to do those two things and confirm this is cemented as “an agreement” and “definitive”. Put a “#1” and “#2” next to those and circle & underline them.

It might sound a little aggressive to imply that someone could/should “dictate” what will happen. I can tell you from experience that if you are working with folks who are out-of-touch with reality, then reality will happen anyway, and it will catch those folks off-guard, and they probably won’t take it well.

If you still can’t imagine what I’m saying, please take a moment to review the fallacy of expecting flawless software and why it is (effectively) impossible to achieve, for anything of-substance. When everyone is willing to have an honest & productive discussion about how to handle flaws, bugs, problems, you can plan how to handle it. Again, I’m not suggesting anyone should be cavalier or dismissive about mistakes/problems, just be real.

When things go bad

Okay. Once we acknowledge “support tasks” are likely to happen, we need to have a reasonable discussion of how to handle those situations and consider what this will look-like. The proper process for handling problems should go like this:

Triage – Ope, we have a problem. Before I start applying duct tape and super glue, I must understand what is wrong, the severity, how extensive it is. Then isolate the cause. Then determine possible course(s)-of-action (fixes/treatments) and finally, weigh options.
Prioritization – If you only have one problem at a time, wow! Congrats! However, if you live in the real world and there is sometimes a) more than one problem, or b) all problems & solutions aren’t equally significant, or c) you actually have options sometimes, or d) maybe you actually have bigger fish to fry right now, (or any combination of these) then you will have to prioritize, arrange schedules, timelines (more expectation setting). Everything can’t be #1 (or 0.9, 0.82, etc)
Fix & release cycle – This requires time and planning. If this fix could actually wait till your next scheduled release cycle, then it doesn’t disrupt your timelines & budget. Otherwise, maybe consider an alternative “dev/ops” approach, where you use an abbreviated/compressed dev/release cycle for some prod break/fix stuff. Then again, maybe this one is so hot that it justifies the use of “cowboy” action (just once). Define the conditions and process for each of those levels of severity, so you choose to act upon a plan instead of emotions or panic.
Turn-around time – if you haven’t set expectations for turn-around time, folks will have unrealistic expectations.
Closing it down for a while – Define support timing milestones/boundaries. Example: if a web site is down for more-than 20 minutes, you will put-up a maintenance message. After 1 hour, send an email to folks (users, support, mgt). If it is down for 2 hours, fail-over to your DR site. After 24 hours, start rebuilding servers, etc. IT folks (developers, etc) will always “have a feeling” like: “the solution is only minutes away”, so don’t bother asking. This kind of optimism is necessary to overcome adversities like this one. Don’t mess with it. Appoint a manager to monitor the clock and run the playbook, as-defined.
Analysis – learning from your mistakes. After the problem has been identified and fixed, let people catch their breath. Next day-or-two (while it is fresh in their minds, they had time to contemplate/reflect a little, but hasn’t been repressed yet), get together and talk about what happened? What lead up to this? Evaluate your response process (good, bad, worse)? There might be room for improvement, but how?
Preparing for the next round – Maybe you could use an ounce of prevention, early detection, adjusting your response actions & time. Your department has layers and each should learn from each experience and prepare to handle the next one better. Hopefully, this doesn’t happen often enough to get good at it, but get good at it anyway.

Preparing for bad things

The good news: even though bad stuff is inevitable, you can make it go-away quicker. Here are a few things which can be done ahead-of-time, to shorten support timelines and turn-around resolutions quicker:

Error handling – It is the metal detector which finds all of the needles in all of the haystacks. If you have adequate error handling and logging, and it is aggregated and managed properly, your folks should be (already) working on a solution before users report a problem
Detection lag – I have been on projects where management expected detection & response within 15 minutes. We rarely met those expectations, except when we did nothing else (eg, no programming, only log monitoring). IMO, it is better to observe the natural (organic/default) detection & response time and start with that as an expectation. Can your users deal with that response time? You can still do things to reel-it-in quicker (emails, sms, fire alarm, etc), but start with a “control group”, (it is what it is) as a starting-point, and talk about ways to improve, and what that will cost.
Tracing – Now this is the “next level” stuff. Sometimes a flaw happens but it doesn’t exactly result in an “error”, it is like when a tree falls in the woods and no-one is around to hear it. Some parts of a system are more prone to issues, surprises, lag, communication hiccups, slowdowns, etc. Tracing those areas will save you days or weeks. Use configs to turn tracing “on” when you need it and leave if off when you don’t. The only down-side to tracing, is that it requires a non-trivial investment and experience to do it well. But when it finds the problem, you will instinctively jump in the air, shout hurray! and high-five folks. It truly is that beautiful when it works correctly.
Cowboy vs DevOps vs Normal patch cycle – You might sound like a madman when you broach this topic. Is it ever appropriate to “go cowboy” on a problem? The short answer is “hmm, rarely“. So it is you & your boss’s job to define “when is it okay”, long-before it happens. It is a powerful, uh, power, and you don’t want to abuse it (spiderman movie quote omitted). I might suggest you treat it like police “use of deadly force”. Audit the use of it: require reviews/interview/(maybe incident report) afterwards. A deterrent and a check/balance will establish measures to prevent unwarranted abuse of “cowboy mode”, if you ever really need it.

The beginning of the end

Planning for bad stuff might initially sound too pessimistic. Some folks might even misinterpret it as granting “wiggle room” to the dev team, thereby aiming-for lower quality or lack-of-discipline. It can be a delicate procedure to set these expectations.

In my experience, somebody always will ask: “Well, what if some magic occurs, and no bugs show up? All of this planning-for-problems might be a wasted effort!” Well, if that happens, buy a lottery ticket, because let’s face it: you have unusual luck. In the real world, you never get lucky. So don’t plan for luck. Plan for problems. I could offer you my money-back guarantee: I guarantee problems, “or double your wasted efforts, back!” (I’m being funny, but it does seem like laws of physics or something)

If you have realistic expectations, plan accordingly, and prepare properly for “THE END” (of your release cycles), everything is more likely (by magnitudes) to go as expected, your team will be prepared and you will thrive in the real world.