The worst that could happen (part 2) – NT4 SP2

My second job, after college, was a big step-up for me. I was working on dotcom projects with a hot new company in the mid 90’s. I was working with the cutting-edge of technology. Some of my work was bleeding-edge stuff. Very few people were willing to try what I was doing. I felt a little like Tom Cruise (minus the good looks), constantly facing adventure and excitement. Sometimes my projects delivered some great challenges to me, but I was always able to resolve them.

This time, I was working with a stack of new technologies: ASP, COM, DCOM, DHTML. Most people had never heard of these (at the time). My project was demonstrating how hot and useful these technologies were.

After three months, I had a working example with a few web pages. They were using these technologies to connect into a “mini” on the back-end (a mini is like a small mainframe). Only myself and two other guys understood each of the parts of it. The customer loved the demo and was excited to see it “go live”. They were way ahead of anyone else in their industry.

Back then, hardware was very expensive. This company chose to spend its money on programmers and not hardware. They didn’t have money to blow on frivolous things like “test servers” or “staging environments”. No way. Each project went straight from the developer’s machine to the production web server, just like every other cutting-edge project that they made.

My project was hurried into production. The server admin walked me into the server room, logged into the server and handed me the keyboard. “All yours”. For a moment, I did a quick gut-check. Then I remembered to ask “This thing is backed up, right?” The server admin said “Sure. Plus, we have all of the install disks. We’ve broken it before, and we know how to fix it. I’m not worried.”

I exhaled and thought to myself, “Okay, if he is not worried, then there is no reason for me to be worried. It is his server”. I started installing experimental drivers and registering DLLs and loading processes into memory. After fifteen minutes, I was ready to reboot and see how close I was getting. I asked the admin if there was any reason to wait. He said “Nope. Go for it”.

The server came back up fine. My program wasn’t fully working, but it was close. Some of the DLLs were out-of-date. I updated them and restarted a few times. Things were going well until, instead of rebooting, the server gave me “The Blue-Screen of Death”. The server would not reboot. WTH!

As I am sitting, thinking about it, somebody came into the server room and let us know that the VP was annoyed that the web site kept on bouncing (up and down). He wondered when we would be done. Golly, what bad timing, “please come back in ten minutes”.

I asked the server admin if he had any suggestions. He looked at me and shrugged, “You are the genius here”.

I reminded him about his statement, that this happened all of the time and they just fixed it each time. He said, “Yeah, that is your job. I’m sure you can handle it”. Um, what? Yes, that is right. When he said “We usually fix it”, what he meant is that somebody other than him would fix it. This time, it was my turn. I wish he had been a little more clear about that before I started replacing DLLs without keeping backup copies of each.

Fast-forward through 6 hours of painful try/fail. Add to that, the guy who kept coming-in (every 20 minutes) to remind us that the VP wants to know “how much longer”. I am starting to run low on ideas for things to try. My usual optimistic attitude is also starting to fade.

Just then, I suddenly realize that it is after 6 pm. “Oh my gosh, I nearly forgot! I have to teach a class at the local community college tonight. I need to leave in 45 minutes!” There is no way that I can get this done in 45 minutes.

I call my boss and explain my situation. He tells me to call a colleague who is working nearby. My colleague sounds pretty irritated, but is willing to step-in for me. I leave to go teach the class.

After the class lets-out, at 10 pm. I try calling the office to see if they are done. Luckily, nobody answers. “Excellent. Nobody is there. That means everything must be fixed and they went home”.

The next day, I show up at work and I am asked to step into the VP’s office. I get an earful about taking down the server for a half day and then walking away from a downed server (that I broke). Apparently, my colleague was up till 2 am, with Microsoft support, fixing things. I can forget about receiving a fruit basket from him, this Christmas.

The root of the problem, apparently was, because the server had NT4 SP2 installed on it. Which meant that it was miracle that the machine ran at all. Only Microsoft could support a machine with NT4 SP2 on it (if you got the right guy on the phone). Who could have known that?! (maybe the sys admin, I suppose)

The good news was that everything was working again, and my project was a success. The bad news (or maybe not), was that I was no-longer allowed in the server room.

Lessons learned: 1) Make your own backups. 2) Keep track of what you are doing when you mess with a server (even if it is on a snap-shotted VM) 3) Never do your first install on a production server. ALWAYS try it on a crash-test-dummy (test VM). 4) Never ask the Sys admin if it is okay to bounce a server. Ask his boss (or manager, or VP).


About Tim Golisch

I'm a geek. I do geeky things.
This entry was posted in IT Horror Stories, Lessons Learned. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s