Years ago, I was handed a project that had a bad reputation. The program extracted data from my employer’s database, shipped it to an external vendor (for intermediate processing), then it was shipped back to us and finally, it was pulled into a reporting database.
The reason that program had such a bad reputation was because periodically, somewhere along the line, the data would get honked-up. We only knew something had gone wrong when the IT help desk got a complaint (from the end-users) saying that there was something wrong with one of the reports. The usual way of handling these complaints was to blame the vendor, “It wasn’t us. The vendor screwed-up our data again”. Then somebody from IT would get the vendor on the phone and every time, the vendor’s response was “nope, everything is fine here. The problem must be on your end” and thus began our adventure.
Each time it happened, it took an IT support person 1-2 person-days to track down the point where the problem occurred. It was a real productivity killer. The people who handled it needed to be patient and enduring. Unfortunately, in this situation, these traits (patience and perseverance) might have worked against them. They needed somebody who was a little more lazy, who would want to fix it once-and-for-all. They needed a programmer.
When this project was first handed to me, I burned-up a person-week trying to track down some missing data. It felt like an eternity. The previous support person (who had been handling it) had a good laugh about it. That had to feel satisfying seeing me struggle with it.
For me, the first straw was also the last straw. After only one support ticket, I already had enough. I can’t afford to lose a week, or even a day on this (periodically). There had to be a better way.
Usually, fixing a buggy program is easy: Step 1: add error logging. Step 2: check the logs until an error shows up. Step 3: fix the error. Sometimes, you could add a step 4, like “do a touchdown dance” or “talk some smack about your kung-fu”.
If the error would have occurred in (only) one program, it would’ve been so easy: I could have applied the previous technique and moved-on. However, there were 6 different (potential) points of failure. Each time, it seemed to fail at a different point for different reasons. So, the real problem was the cornucopia of opportunities to fail.
The ideal solution would have been, to eliminate some of those points of failure by consolidating the system, but that was not a reasonable option. Instead, the second-best choice was to set checkpoints and monitor each/all of them with intermediate diagnostics. This would be a “thanksgiving day buffet” of diagnostics.
I started by adding logs to track the data as it went from one point to the next. Then I made a few reports that made it easier to check the logs and the data (with a few summaries and sorting/filtering options). When the IT help desk got a complaint, it usually said something like this “account number 123 is seeing blank data for April 1” (no joke). With the new diagnostic reports, I could look at the data for check-point 1 and see if was correct, then check-point 2, and then 3, etc. At some point, the data would wash-out. Then I would check the logs to see if any errors were reported. These diagnostics were like using a metal-detector to find a needle in a haystack. (beep……beep…..beep..beep.beep.beep.beebeebeebee-ee-ee-ee. Pwned!)
Next time a complaint came-in, I didn’t have to call the vendor, because I could see it was on one of our systems. After a few weeks, our vendor started feeling pretty lonely. Then the day came that some data disappeared and my diagnostics showed that it happened between checkpoint 4 and 5 (the vendor’s site).
When I called the vendor, they nearly forgot their usual routine, “Nope, everything is fine here. The problem must be on your end”. I guess I surprised them when I asked them to take a look at account 456 for June 30 and check their email, because I just sent them two PDFs titled “Before” and “After”. I got no arguments, just silence. After a few minutes, they confessed that they had a data integrity problem the night before. This confirmed one of their suspicions and gave them something to focus-on. It took them a mere 4 hours to resolve the problem and we were able to confirm the fix immediately.
The point of my tale is this: Some systems are buggy because of factors outside of your control and there is nothing you can do to make them more stable. When that happens, your next-best-choice of action is to construct some diagnostics that make it quicker to pin-point the problem. In this case, the diagnostics saved us (an average of) 1 person-day per incident and it moved us in a more proactive direction. Eventually, we were able to pull those diagnostics together and monitor them more closely, so we could find the problems before the customers did. We became proactive. Customers loved that. So did my boss.
The up-front effort was unpleasant and the solution didn’t seem ideal, but it worked pretty well.
So if you have any kind of distributed system, especially one that integrates other systems, save yourself some time and headaches by implementing some checkpoint diagnostics. They will either, 1) pay-off or 2) activate Murphy’s law, (as it applies to diagnostics: if you handle an error, you will never see that error again. If you don’t handle the error, it will bug the heck out of you.)
Diagnostics: the leading pain-reliever for elitists.