The worst that could happen (part 3) – Oracle and The Blob

Oracle and The BLOB

It was the biggest project of my career: A Y2k project. Tens of thousands of people would rely on this system to do their work each day. It had to be perfect and it had to be reliable.

There was no way that the customer was going to cut corners where it mattered. They dropped some pretty serious coin on a cutting-edge database: Oracle 8 parallel on HPUX-64. I’m talking two very strong machines, working in tandem. 20 RISC CPUs per machine. Did I mention this was back in ’99. Yeah. Back then, that was a lot of pasta!

The most impressive part was that the database was using the new Oracle Parallel server. If one server got really busy, the other would be there to pick up the slack. We could even fail-over, if one of the DB servers crashed, but we were told that you would never see that happen, because “Hey, this wasn’t some MS SQL on NT 4 stuff. No-ho, my friend. This was a real man’s server. HPUX never-ever-eeeevvvvveeeerrr went down”. And we had two of those babies.

As developers, our job was to treat those database servers with the proper respect that they were due.

Our team worked our fingers to the bone, getting the programming part of the project done (a few weeks early, at that). We spent the extra time doing some performance tuning, to make sure that we didn’t waste the DB’s CPU cycles. Of course, there were plenty to spare, but that is still no reason to waste them.

Go-live went smoothly. We were upgrading everyone in waves. We were still ahead of schedule. Our heads were maybe a little bit bigger, but nothing to worry about.

As the DB grew, we kept an eye on things. We noticed a strange thing start to develop. Every now and then, the web servers started to queue-up requests. It was like things were clicking along (thousands of hits per minute) and then somebody hit the breaks. Everything froze and the web servers announced “the line starts here”. After a minute, things would clear-up and go back to normal. During that minute though, our hearts sank.

We couldn’t figure it out. For the past 12 months, nobody had seen anything like this. It was only after our production database started filling-in, this started to surface. My team had several meetings to raise ideas. Nothing was too absurd to test or monitor at this point. Nothing thing except the DB servers. Because, hey, it was Oracle. Besides, with parallel server, if one machine was too busy (or crashed), the other machine would pick up the slack. Well, the DB admins already checked. Both DB servers were pretty calm. So don’t look at them.

For two weeks, there was no pattern. Our tests expanded, but found nothing. Until one day, one of the developers walked into my cube and said “I got it”. He walked me into his cube and started clicking through the program. “There it is. When I add a comment and then remove it, watch the server monitor graph”. I was about to dive for his hand and prevent him from clicking. Certainly he would not voluntarily lock-up the dev server, just to prove a point. Well, it is just the dev server.

He clicked [ Ok ] and we turned to the graph. All traffic halted and the servers queued up requests as bad or worse then they ever had. “Yes! You found it. But wait a second. How are you generating all of that traffic on the dev servers?”, I asked.

He explained, “Oh, I couldn’t recreate this bug on the dev servers. There is not enough data. This is on the production servers”. My heart stopped beating. I could not breathe for a second. When it started again, I was able to intercept the stream of profanity, before it left my mouth. *cough* *cough* “Why in the world would you do that on production?!” I asked. He was still feeling proud, but a little confused, “I had to show you that this list of steps worked. See! It works!”

We didn’t do any demos for anyone else. I verbally explained the problem to management while he and the rest of the developers determined a root cause and a fix.

It turns out that the root cause had something to do with comments being stored in a BLOB field. A BLOB is a Binary Large OBject. It is the way that a database stores fields that are too big to fit into a normal field. For some reason, our Oracle server had no problem creating BLOBs, but deleting them was another story. Any time we tried to delete one, it froze all database connections. After a minute, things were unlocked and everything proceeded. So, it wasn’t a performance issue or the result of a crash, it was just some bug that froze everything up. Luckily, it was isolated to one server, at a time.

We didn’t get a patch right away, so our temporary solution was to mark comments (and any other BLOB fields) with a “needs to be deleted” flag. They were deleted at night, when nobody would notice the freeze-up.

Lessons learned: 1) The bug could be anywhere. After you rule-out the things that are possible, start considering the things that should be impossible. 2) Many hands make the testing better. 3) Keep good logs of what is happening at all times. We should have been able to find this by comparing freeze-up times with the web activity logs.

Advertisements

About Tim Golisch

I'm a geek. I do geeky things.
This entry was posted in IT Horror Stories, Lessons Learned. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s