The DBA Detective: The Case of the Missing Index

When problems arise in SQL Server, we're faced with a server full of suspects, including disk I/O, memory, CPU, incorrect or missing indexes, badly written T-SQL code, out of date statistics, and full disk drives. All of these have motive and opportunity to murder the performance of our databases, and it's the DBA's job to collar the culprit, quickly without relying on luck or heroics.

The Interminable Hourglass

‘Joe Dee Beeay?’

I looked up. The cubicle smelled of old coffee, dust from old technical manuals, the tang of heated electronics and fear.

‘Yeah?’ Something in the boss’s voice told me that he hadn’t come to tell me I was doing good work.

1805-img5.gif

‘The application’s had a wet job. It’s dead meat’

The application, THE application, you know the one. The beautiful one, the one everyone wants. She had the habit occasionally of running as if she had had her hair parted by lead piping. Some in the business say she’s brought troubles on herself, but it’s just the way she was built. Sure, she’s a little older now, and is showing the signs. She’s been beat down and roughed up more times than he cares to remember and it shows, but she was still The Application, and she needed me.

The Fat Man’s voice was calm, but he had sweat on his brow.

‘What’s the layout?’ I asked.

‘Dunno. Just sort it out, Joe, please.’

‘I wasn’t worried until I heard the word ‘please’. Maybe, too, it was that humorless smile he had before he stumped out of the room that made me wince. I know what he was thinking: ‘There are no bad DBAs. There are only some DBAs that aren’t as good as others.’

Now what?

I tapped the spacebar, lost in gloomy thought. I looked out the window. Suddenly the world seemed dark with more than the night.

Every possible kind of problem had happened to this application over the years. She’s been put back together so many times it’s surprising that she’s still so desired and attractive instead of looking like Frankenstein’s monster on the outside as well as the inside. Regardless, Joe Muggins had to identify the problem.

False Dawn

1805-img10.gif

 Better go see Dawn in Accounts’ she knows when things aren’t right and likes to give bad news. Dawn was a dame with an ice-cold stare. From thirty feet away she looked like a lot of class. From ten feet away she looked like something made up to be seen from 30 feet away. She pouted when I told her I was on a mission to fix The Application, sipped from her plastic coffee-cup, and then sang like a canary. Well, like a canary who wanted to upset her audience. She told me that the issue was happening currently, but had also happened last night and, evidently earlier in the week.

‘Thanks’, I muttered. ‘You’re welcome’, she replied, meaning I was as welcome as an over-packed bin-liner.

I trudged moodily back to my pigpen in IT. No one had thought it was important enough to inform the DBA team. If only we’d put some monitoring software in place to identify issues when they were happening, instead of relying on other people, with other concerns and problems, to get the information to the team.

I reached for my faithful SSMS. “Hmm. Signs of life in the old crone?” Waiting for the connection to complete seemed as if I was  waiting for that next blow to land in a beating. Success! the server is online and accepting connections; That’s a positive sign. Without that reassuring open query-window, I’d have had to resort to the lo-down, dirty trick of using the Dedicated Administrator Connection to elbow my way onto the server. Also, if the server wasn’t accepting connections or, worse yet, was offline, I’d have had to trust that I could revive the cold corpse by restoring from backup. There were backups running, but, hell, when was the last time we’d verified them?

I shook my head, trying to shake loose that nagging doubt about those unverified backups. Something to check once this crisis is past.

Once I’d connected to the server, a wild world of lash-ups and software harvester-tape opened up me. I shuddered. A DBA’s work is sometimes unsavory. Where first? Travel down to the grit and dirt of the tempdb? Now there’s a dive where some of the worst of the low-life lurk. Look for the rot within the glizt-and-glammer of the CPU? Walk the mean streets of the I/O processes? Or just round up the usual suspects by taking a look at the currently running queries?

Let’s start with the usual suspects, I decided.

The Usual Suspects

Sure, I could look at wait statistics to understand specifically what is causing the server to slow down. It’s a great metric for understanding the system as a whole. I could maybe run a query against sys.dm_os_wait_stats and order by the number of currently waiting tasks, the cumulative wait time or the max wait time to see what, in general is causing the server to run slow. Yeah, that might flush the critters out.

I stared moodily at that empty query window. Nah. C’mon Joe, since we’re only getting calls from a single team on a specific application, the other approach, seeing what’s running on the server, might stir the pot more.

To see what’s currently running on the server, I could run a query against the Dynamic Management Object (DMO) sys.dm_exec_requests. I gave the server a burst of SQL through my trusty SSMS:

The server babbled out its secrets as though I’d promised it a nice waterboarding holiday. I saw immediately a full listing of all sessions running on the system. I had useful information at my fingertips, and then details that may, or may not, prove useful later. To narrow down to some of the more immediately useful data, I modified the query like this:

One session stood out in the results, about as inconspicuous as a tarantula on a slice of angel food cake. It was running long and using resources like a fat man pursuing a lead bird.

What I needed now was the text of this query this session was running, and its execution plan. I start to write another query, thinking there must be an easier way to pull all this information together in a hurry.

I sat back in my chair with a creak and a whistle. I whistled, and the chair creaked, but it might have been the other way around.

No query should be running like this, I mused, and especially not this one. I tuned it recently, I remembered, and added a new index in order to make it more efficient.

Looking at the execution plan, I noticed a scan where an Index Seek used to be. Could something have happened to the index or might the statistics simply be out of date? Goddam it, was there a rogue table? I soon had my SSMS looking straight at that table.

Well, if I hadn’t homed straight in on a pretty sinister problem. That index I’d created earlier had vanished, done a runner, as if I’d never done all that tuning work. I don’t like seeing things like that. It makes the hair on the nape of my neck prickle. What else in the database had been given the big sleep? Things like that don’t vanish by themselves.

I swore. I needed to get that index back, and find out what else was missing. Compare with what’s in source control? I’m a production guy; I don’t put stuff in source control. Sure, it gets there eventually, but the devs do it on wet Fridays. There was all sort of healing stuff we’d done on that database that hadn’t been fed back into source control. Did I save my build script? I didn’t feel like answering that question, but hurled my damp plastic coffee-cup over the top of the pigpen. It cheered me a bit.

I smiled grimly. Wouldn’t it be a fine time to find that the backup that run just after my database-tuning session failed or was corrupt? I’d be able to find out if only I had a place to restore the database. I’d always meant to get to using compressed backups.

I didn’t want to eat up more disk space on the production server, with the restore; I’d have to do it somewhere else and, luckily, the QA team wasn’t around to object. This is a production issue, and even if they were screaming from their pigpens, I’d do it. When they’re slapped, they’ll take it and like it. It is not a fragrant world.

If only there was a way to directly access the backup files to either pull the code out or simply connect them to the server without having to use so much disk space.

I went to the first backup, last night. After setting up the restore process, I had a moment to reflect on the cruelty of the human condition while the restore operation ran. A rush of errors on the screen soon told me I was about to experience this first-hand. Last night’s backup really was corrupt.

I sat down. It was a good start, but it didn’t go far enough. I ought to have turned out the light and hidden under the desk

The Screen Painted Red

How many backups do they keep locally for this server? Ah, it goes back three days. No need, at least, to try to contact their offsite storage company to retrieve a backup. Crossing my fingers, I tried restoring the older database. It worked. I browsed down to the index and exported it to a query window.

Quick as a rabbit, I changed the connection to the production server (change requests are for those squares who shall inherit the earth eventually) and fired off the script to recreate the necessary index. I waited, and waited. Suddenly, I felt like a fugitive from the laughing house; the screen is painted red like the lips of a cheap woman, as error messages pop up. There’s no space on the hard drive for the index.

I knew that whatever malicious spirits inhabit our datacenter had conspired to ensure that were running low on storage but, heck, I didn’t know we were that low. Then it all clicks. Wasn’t there a message from earlier this week? One of the junior DBAs, Timmy, had dealt with an issue where they had run out of space on a drive overnight. He fixed it somehow, but I hadn’t been watching that closely, he had problems of his own.

The Smoking Gun

Digging back through the email, there it is, the smoking gun, the drive had run out of room and the junior DBA had decided that the new index wasn’t needed, had dropped it like a murder weapon over the side of a bridge. There is no trap for a DBA so deadly as the trap set by an idiot colleague.

1805-img16.gif

I contacted the SAN admins and somehow managed to persuade them to get a new LUN allocated to the server. There I added a filegroup for the database and was able to recreate that index with room to spare. Now to have a conversation with that junior DBA that ought to have stuck at least four inches out of his back.

The End?

The fat man seemed pleased as punch. A punch in the mouth. He decided against it. Instead he gave me that smile again.

 1805-img1C.gif

“OK Joe. So, plenty of heroics to retrieve this situation, but what caused it?”

Hmm. That’s a hard one. “The application’s no good”

“My life’s no good, but I’m stuck with it.” I didn’t like the look he gave me. “Instead of being told by the business people that there was a problem, we should have found out ourselves before them, and found out in more detail. There might have been several applications complaining about a slow server, meaning bottlenecks in the hardware or misconfiguration of the server. You got lucky. Maybe some up-front planning, with the right software tools, might have entailed less heroics. I need fewer heroes and less luck”

I shrugged. I’m like any good detective; I gotta go where the clues lead me. A true DBA Detective relies on his trusty SSMS and shoots from the hip. Or, maybe the fat guy had a point.

When problems arise in SQL Server, we’re faced with a server full of suspects, including disk I/O, memory, CPU, incorrect or missing indexes, badly written T-SQL code, out of date statistics, and full disk drives. All of these have motive and opportunity to murder the performance of our databases, and it’s our job to collar the culprit, quickly.

It requires a methodical, levelheaded approach, a refusal to jump to conclusions until all of the facts and clues are exposed, and it requires the right set of tools to get the job done, and  increase my luck.

  First published as lesson 1 of the DBA Team series on the Red-Gate.com site. See lesson 4, ‘Disturbing Developments’, for the continuing saga of Joe Dee Beeay