Over the past few months I’ve been interviewing candidates for systems administration jobs, and one of the questions I like to ask is about troubleshooting blue screen error messages. I try to find out what kind of approach the person takes when confronted with the dreaded blue-screen, and the answers I’ve had have surprised me, and not always pleasantly. Many of the candidates didn’t have a methodical approach, and more often than I would’ve liked, the first steps they say they would take would be to reinstall or re-image the system, or to replace the memory. Now don’t get me wrong, reinstalling the OS or replacing the memory may solve some blue-screen-causing issues, but they shouldn’t be the first troubleshooting steps taken.
The blue screen (or blue screen of death, blue screen of doom, or BSOD) is properly known as a “Windows Stop Message”. It is displayed when the Windows kernel or a driver running in kernel mode encounters an error which cannot be handled. This error could be something like a process or driver trying to access a memory address which it did not have permission to access, or trying to write to a section of memory which is marked read-only.
More to the point, Stop messages don’t occur without a reason; they are an indication that the system has a problem somewhere – hardware, software, or device drivers can all be the cause of the fault. Often a simple reboot will get the system up and running again, but if the underlying problem is not solved, the blue screen will probably come back again.
The aim of this article is to offer a methodical approach to troubleshooting stop messages, with a few simple steps which can take a lot of the guesswork out, and could get your system back up and running more quickly and easily than reinstalling the operating system.
Step 1 – Read the message
It may sound obvious, but the first step is simply to read the message displayed on screen. Often there is enough information displayed to point you to the cause – if the stop error is caused by a kernel-mode driver, the driver image name is generally shown in the message.
Figure 1 is an example of a fairly common stop message – “DRIVER_IRQL_NOT_LESS_OR_EQUAL“. This stop error is caused when a kernel mode driver attempts an illegal memory access. The “Technical information” section shows the STOP code, and also lists the specific driver which caused the fault – in this case it’s “myfault.sys“, which is the driver installed by the Sysinternals utility NotMyFault.exe, which I used to trigger this crash. In a real-world crash, the driver image name could be any kernel-mode driver installed on the system, but once you know the name of the driver it can be located on disk, and the vendor found by checking the file properties.
In terms of finding quick solutions to the problem, the vendor may have an updated driver you can try, or could have a knowledge base you can search for a resolution. However, not every stop message will make it that easy – sometimes there is little more than a STOP code.
Although it looks fairly Spartan, there is still some useful information in this message – the “Technical Information” section includes the STOP code (0x0000007B in Figure 3), and occasionally that can be enough to get started with troubleshooting. However, unless you already know what error the stop code translates to, this is where we move to step 2: Searching.
Step 2 – Search
If the stop message hasn’t given enough information to start troubleshooting, the next step is to search for more details. Again, this may sound obvious, but in my interviews I was also surprised by the number of people who did not mention that they would use the Microsoft Support knowledge base, Microsoft TechNet, MSDN, or some other on-line resources when troubleshooting blue screen errors.
For example, a quick search of MSDN or TechNet will reveal that the stop code shown in Figure 3, 0x0000007B, translates as INACCESSIBLE_BOOT_DEVICE, which means that the operating system failed to initialize the storage device it is attempting to boot from during the I/O system initialization. This generally indicates a storage driver problem, and knowing that the problem is caused by the storage subsystem helps to focus troubleshooting to a specific area, which should make the error easier to diagnose.
There are many, many websites offering help with troubleshooting stop errors. My preference is always to start with Microsoft sites or hardware vendor sites, then broaden my searching to other sites and forums if I can’t find what I need. In most cases, someone else will have experienced the same problem, and there may be documented solutions or workarounds offered.
Of course, both steps one and two rely on one crucial thing – that you’ve witnessed and/or recorded the stop message. If you haven’t seen the stop message occur, then you can find the stop error and parameters in the System event log, but unfortunately there are no additional details such as the stack trace. Nevertheless, even with the details of the stop message, there still may not be enough information for a conclusive diagnosis, and at this point we need to move on to step three: Crash dump analysis.
Step 3 – Analyze
The third and final method in my approach is to perform basic analysis on the crash dump file, which all Windows systems are configured by default to create. There are three types of crash dump file, and the settings for controlling which type of files are created can be found on the Advanced tab in the System Properties dialogue box.
Complete Memory Dump
A complete memory dump contains all the data which was in physical memory at the time of the crash. Complete dump files require that a page file exists on the system volume, and that it is at least the size of physical memory plus 1MB. Because complete memory dumps can be very large, they are automatically hidden from the UI on systems with more than 2GB of physical RAM, although this can be overridden with a registry change (which I won’t discuss here).
Kernel Memory Dump
A kernel memory dump contains the kernel-mode read/write pages which were in physical memory at the time of the crash. The dump file also contains a list of running processes, the stack of the current thread, and the list of loaded device drivers. Kernel memory dumps are the default on Windows Server 2008 and Windows 7.
Small Memory Dump
A small memory dump (sometimes also called a mini-dump) contains the stop error code and parameters as well as a list of loaded device drivers, and a small amount of other data. Small memory dumps must be analysed on a system which has access to exactly the same images as the system which generated the dump file, meaning that it can be difficult to analyse the dump file on a system other than the one on which it was created.
For basic crash analysis, a kernel memory dump is usually adequate and, as shown in Figure 4, the default location for its creation is %SystemRoot%\MEMORY.DMP. The tool required for analysing the crash dump file is WinDbg, the Microsoft Windows Debugger, which can be downloaded from Microsoft’s website.
After installation, WinDbg needs to be configured to use the Microsoft Symbol Server. Once symbols are configured, click the File menu, choose Open Crash Dump, and select the crash dump file you want to analyze. The output from WinDbg will look like this:
The second to last line, which starts “Probably caused by” indicates the debugger’s best guess at the cause of the crash. In the example in Figure 5 the debugger is correct – this crash was caused by NotMyFault. Other information in the analysis indicates that the crash dump file is a kernel memory dump, and that symbol files could not be loaded for myfault.sys (because it is a third party driver, and the symbols are not available on the Microsoft Symbol Server).
More information can be gleaned from the dump file by executing verbose analysis, using the debugger command !analyze -v.
The verbose output shows the description of the stop message, which will save you having to search for it, and will also include the stack trace of the thread which was executing at the time of the crash, which could also prove useful if further debugging is needed.
Basic crash dump analysis is easy, the tools are readily available, and a lot of information about the crash can be found in just a few seconds. If basic analysis doesn’t help to solve the problem, there are many excellent resources available which give much more detailed information about the Windows Debugger and its use, and can provide in-depth guides on how to extract and interpret the data using advanced analysis techniques.
The Debugging Tools for Windows web site is a good resource to start with, as is the book “Windows Internals” (currently at its fifth Edition) by Mark Russinovich, which includes a detailed chapter on crash dump analysis.
I hope that the information and the basic troubleshooting method of Read, Search, and Analyze help to take some of the mystery out of blue screen crashes, and enables more administrators to get their systems back up and running quickly. When it comes to speedy recovery, established procedures and troubleshooting methods are worth their metaphorical weight in gold, as randomly trying different approaches will cost you time, energy and sanity.