The SQL Server Instance That Will not Start

Comments 0

Share to social media

The SQL Server service won’t start… cue ominous music. It’s a situation that brings immediately to the mind of the DBA images of late nights, lost weekends and cold pizza at 3 a.m.

A number of problems can prevent the SQL Server service from starting and this article will examine the more common ones and, I hope, offer a way out of the nightmare. I cover:

  • Service account password incorrect or account locked or disabled
  • Corrupt or missing master database files
  • Corrupt or missing model database files
  • Unable to create tempDB
  • Unable to open the error log

Before diving into the various possible causes, we’ll start with some basic troubleshooting steps that will help identify the cause.

Common Symptoms and Initial Troubleshooting

The first and most obvious symptom will be that the SQL Server service is not running and an error greets your attempts to restart it.

1722-1-0c051a8c-8fd5-4253-b295-0ee9026b6

Figure 1: Unhelpful error

Alternatively, if you tried to start the service from the command prompt…

1722-nightmarefigure2.png

Figure 2: Another unhelpful error

Neither of these messages helps identify a cause for the startup failure. Oh, and just in case the message in Figure 2 raised your hopes, more help is not available by running NET HELPMSG 3547. It just gives the same “Service specific error occurred” message.

The first place you should start looking for the cause is the SQL Server error log. The Windows Application Event Log will do as well, but I prefer the SQL error log, as I can see all the messages at once, and not have to click on them one by one.

The SQL error log is just a text file on disk. If you’re unsure of its location, check the startup parameters of the SQL Server service in SQL Server Configuration Manager, as shown in Figure 3.

1722-2cef17a8-c42c-4ade-8306-ef85f2a2767

Figure 3: SQL Server startup parameters

The -e parameter is the one we need and, in my case, I can see that my error log should be in the folder:

C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\Log\

This is the default location for the error log for the default instance of SQL Server 2008. To find the cause of the startup failure, navigate to that folder and, in Notepad, open the latest error log file (called ERRORLOG) and scroll right to the bottom.

If SQL Server at least started its startup routine before it failed, then it will have logged whatever error it encountered that prevented the service from starting completely. Two cases where there won’t be a relevant error log entry are if:

  • The service encountered a login failure (service account password invalid or account locked or disabled)
  • Either the defined location for the error log files is not there, or the SQL account does not have permission on it.

In the first case, SQL Server won’t even begin the startup routine and hence won’t log anything. In the second case, it has nowhere to write the log entries. If we find nothing in the SQL error log, we need to examine the Windows Event log to find out why, and we’ll return to this eventuality in the later sections. However, in this case, the error log is present and the very last message in the log reads as follows:

I can see from the message that I have a problem with the master database and I can direct my troubleshooting efforts there. This approach eliminates any flailing around, in panic, trying out random solutions without any idea of the root cause of the problem, which is a ‘methodology’ that wastes time and might do more damage in the process.

The first, basic troubleshooting step complete, it’s now time to delve into the various possible causes for the failure of the SQL Server to start, and recommended solutions.

Incorrect password or service account locked or disabled

I’ve seen this happen when the administrator has configured the SQL Server service account to require regular password changes, and also in cases where DBAs or developers know the SQL Server service account password and sometimes use it, often to avoid “inconvenient” security policies such as those forbidding developers from having sysadmin access to the production servers.

If someone changes the service account password, but doesn’t update the SQL Server service with the new password, then SQL Server will run fine until the next time we restart it, at which point we’ll see the error messages in Figures 1 and 2.

Additionally, if a user repeatedly enters the password incorrectly, or a service (SQL, or any other service that may be using the same account) repeatedly attempts to submit the old password, after a password change, then the account will eventually be locked. As before, this will have no effect on a running SQL Server but the next time that the service attempts to start, it will be unable to do so.

This is one of the cases where looking at the SQL Server error log is fruitless. The latest log file (Errorlog) will contain only messages from before the restart and will have no indication of the problem. If I simulate this problem and look at the error log, the last messages in the log read:

These messages simply record the last shutdown of SQL Server before the service failed to start, and there is nothing in that log to indicate any problem. However, the lack of error messages is, in itself, useful, as it eliminates many potential causes that would result in logged messages.

The next place to look, to see if it is a security problem, is the Windows Event logs, specifically the System log (not the security log). If I go and look there (Administrative tools | Event Viewer), and filter for errors, I see the following:

1722-1-cc6f515b-d19f-42d1-871d-6167aebab

Figure 4: Useful errors from the Event Log

That’s a clear indication that an incorrect password caused the problem! Other possible messages are:

  • Logon failure: account currently disabled.
  • Login failed: Account locked out.

The solution to these problems is very simple; if the password is in error, change the password that the SQL Service is using to match the account’s true password, or reset the account password, if it’s unknown, and change all services that use it to have the correct password. If the account is disabled or locked, then enable or unlock it.

To prevent problems such as these, I would recommend that you use very complex passwords for the SQL Server service account, and then don’t set the passwords to expire. Nobody, especially not developers or administrators, and nothing other than the SQL Server service, should use the service account. I recommend, generally, that different SQL Server instances and services use different service accounts. If possible, set the SQL Server service account to disallow interactive logins.

Master database files missing or inaccessible

One of the first things that SQL Server needs to do in the startup process is to open and recover the master database, so that it can load the configuration settings and locate and open the other system databases and all the user databases. If SQL Server cannot find the files for the master database, for some reason, then it cannot start.

If someone modifies the startup parameters incorrectly (forgetting a “;” between startup parameters is a common mistake), SQL Server could look in the wrong place for the master files. Alternatively, the location might be correct but inaccessible due, for example, to a drive failure, or the SQL Server service account not having permission to access the specified files or folders. In each case, the startup will fail.

In such cases, the SQL Server error log will hold useful information. The last message in the error log will look something like one of these:

Or:

The key here is the first part of the message, the operating system error code, in the first case Error 2, in the second Error 5. In the first example, the description isn’t useful, but an Internet search for “Operating system error 2” will quickly reveal that it means File Not Found, indicating that the master.mdf file is just not there. In the second example, an “access denied” error indicates that SQL server does not have permissions to the mastlog.ldf file.

Other common errors include “Error 3: Folder not found” and “Error 32: File in use by another process“.

The first step here is to identify what has happened to the master data and log files. Has someone changed the startup parameter to point to an incorrect location? If so, we need to alter the startup parameter so that it points to the correct location of the database files. You can find the startup parameters under the properties of the service, in the Service Control Manager application, as shown in Figure 3. The -d parameter specifies the location of the data file, the -l specifies the location of the log file.

Is the drive they are located on available? Is the folder they are located in accessible? Are the files there? If the answer is “yes” in each case then try to start SQL Server again, as the error may have been a temporary one. If that works, then you still need to investigate why the problem occurred (maybe an incorrectly configured anti-virus scanner; maybe the disks take time to come online) and rectify it so that it does not happen again.

If the files are not where they are supposed to be, perhaps because the drive failed or someone accidentally deleted them, then we’ll need to restore them from backup (and everyone does have backups of the system databases, right…?)

So, off to Books Online and the “Restore the Master Database” page (http://msdn.microsoft.com/en-us/library/ms190679.aspx), which tells us that to restore the master database, we must first restart the instance in single user mode, using the -m flag.

1722-6d9d4734-9577-486f-afc0-0343778ce87

Figure 5: Attempting to restart SQL Server using the -m flag

Or not…! SQL won’t start without the master database, even with the -m flag. In order to restore the backup of the master database, we’re going to have to:

  1. Rebuild all the system databases
  2. Start SQL Server in single-user mode
  3. Restore the backup of master, as well as backups of model and MSDB (rebuilding the system databases rebuilds all of them, which is why we need to restore model and MSDB in addition to master)

As an aside: in addition to taking regular backups of the system databases, I like to take a one-off, post-installation copy of the files for each of the two main system databases (master and model). This is not because copying database files is a good backup strategy, but because it makes it easier to recover from this kind of situation, as it means I have files to put down should the system database files disappear completely. This allows me to start the instance and restore the backups of the system databases.

Books Online documents well the process for rebuilding all the system databases (http://msdn.microsoft.com/en-us/library/dd207003.aspx). The tool doesn’t say much, but it does the job.

1722-669c696e-1609-41da-90c9-b892da281e3

Figure 6: Rebuilding the system databases

Once done, startup SQL Server in single user mode (-m) and restore master, model and msdb from backups, as per the Books Online entries.

Master database files corrupt

Just finding the master database files is not enough; SQL Server must be able to open the database, recover it and bring it online. Severe corruption may result in the master database failing to come online and hence the startup of SQL Server failing.

As in the previous case, the last of the SQL Server error logs will contain an error that describes the problem. The final messages in the error log will look something like this:

Or maybe this:

Depending on the type of corruption, and on which parts of the files are corrupt, you may see different messages. The key is that there is some fatal corruption found in the master database and the solution in each case is the same as described in the previous section for missing files. Rebuild the system databases, restart SQL Server in single user mode and restore the backups of all the system databases.

Model database files missing, inaccessible or corrupt

In the startup process, once SQL has opened and recovered the master database it needs to bring the other system databases online, starting with TempDB. However, SQL Server cannot simply locate and open the TempDB files and run crash recovery, as it does for the other system databases. TempDB is not like the other databases in the instance in that it’s not intended to be a permanent store of data. As such, logging works differently for TempDB; for this database, SQL Server cannot roll transactions forward, which is part of a normal crash recovery. Therefore, instead of simply recovering TempDB on startup, SQL Server must return it to a known state, and the model database provides that known state. What happens is that SQL Server:

  1. Locates and opens the existing tempDB data and log files
  2. Sets their file sizes to those specfied for tempDB in the system catalogs
  3. Recreates or ‘clears’ (the terms are used interchangeably in the documentation and error logs) tempDB using model, copying over allocation pages, bitmaps, etc.

If the TempDB files don’t exist (see the next section), SQL will create them based off the model database files (much as it would when a user database is created) and size them according to the file sizes specified for TempDB in the system catalog.

All of this is a slightly long-winded way of saying that in order to start up cleanly, SQL Server needs to clear TempDB. In order to do this, SQL Server needs to bring online (open and recover) the model database, as model is the template. If the model database files are missing or damaged, this clearing of TempDB will fail and the SQL Server startup will fail. However, I hope the more-detailed explanation, above, will help you make sense of the messages you see when troubleshooting startup issues that relate to problems with model.

The SQL error logs will contain messages explaining the problem and they will be much the same as for missing or corrupt master database files, just for the model files instead.

Model database file missing:

Model database file corrupt:

Starting up database ‘model’.

Note that in each case we get errors relating both to the problem with model and with the inability to create TempDB.

If a model file is missing, we need to find out what happened to it. Has a drive not come online (or come online after SQL started)? Has someone accidentally renamed or moved the file? If we can make the model files accessible to SQL Server, in a similar fashion as described previously for the missing or inaccessible master files, then this will solve the problem.

If a model file is corrupt, then we need to restore a backup of model. Let’s see what Books Online has to say about restoring model:

“Restoring model or msdb is the same as for performing a complete database restore of a user database.”

Well that’s useful… not! SQL Server won’t start without model, so we can’t just restore it as we would a user database.

Let’s try starting SQL Server from the command line, with traceflag 3608 (http://sqlserverpedia.com/wiki/Trace_Flags). This traceflag tells SQL Server to recover only the master database upon startup, so SQL Server will leave model (and the other system databases) closed; it won’t attempt to recover model or clear TempDB, so the damaged model file shouldn’t cause SQL Server to fail to start.

Sure enough, SQL Server starts. Great, now let’s fire up SQLCMD and restore model, as shown in Figure 7.

1722-8450f6f8-b210-4f61-a561-61b76b6a34e

Figure 7: Attempting to restore model, after starting SQL Server with traceflag 3608

That wasn’t quite what I wanted to see. Let’s look at the error log (which SQL Server also prints to the console when we start it from the command line) and see what it reports.

What we have here is a classic catch-22. Previously, we got SQL Server started, albeit running in a limited “master-only mode”. However, when we issue the RESTORE command, SQL Server, before it even attempts to carry it out, firsts tries to clear TempDB (not because it needs TempDB to perform the restore operation, but as part of bringing the rest of the system databases online). To clear TempDB, SQL Server tries to start up model, which it can’t do because one or more of model‘s files is corrupt or missing. As such, the process fails before the restore of model starts.

Since the problem seems to be the clearing of TempDB, let’s go back to the beginning and try to restart SQL Server, but this time with traceflag 3609 instead, which tells SQL Server not to clear TempDB on startup, but just to recover it from the existing files. This is not something that you would ever run normally, because the previous shutdown could have left TempDB in an inconsistent state (remember, for TempDB SQL Server cannot roll transactions forward, which is part of a normal crash recovery). However, for recovering from a disaster it can be useful.

This time SQL Server won’t even start. We specified via use of traceflag 3609 that SQL Server should, on startup, recover TempDB from the existing files but not attempt to clear it. Nevertheless, it still attempted to recover model, along with the other system databases (because this time we didn’t tell it not to) and the failure to recover model resulted in SQL shutting down again.

Oddly enough, the error log still complained about not being able to create TempDB, though it had no need to do that. Perhaps that’s just a default message when failing to recover model.

Well, if one traceflag won’t do the job, maybe two will:

This instructs SQL Server not to recover any database other than master, and not to clear TempDB, just to recover it, as is, from its existing files. In other words, if SQL does need to bring TempDB online at a later point, it won’t attempt to clear it, which is what requires model). This time, as for when we used traceflag 3608 on its own, SQL Server starts up just fine.

Now, to restore model:

1722-199fbe57-d95c-4dc3-8d05-cc3c0cf70a3

Figure 8: Restoring model, after starting SQL Server with traceflags 3608 and 3609

The restore works this time, but let’s see what the error log reports:

The error log shows that SQL Server started TempDB but thanks to the use of traceflag 3609 when we started SQL Server previously, it only recovered TempDB; it didn’t attempt to clear it. As an aside, this is the only time you’d see the “Recovery is writing a checkpoint in database ‘tempdb’” message, as recovery does not normally run on TempDB. Note that if the TempDB files were also damaged or missing however, this would result in SQL Server shutting down.

The critical point is that, this time, it was the RESTORE command that caused SQL Server to try to start up model (something it would for any database being restored), rather than an attempt to clear TempDB. As such, the existing, damaged model database is already marked as “restoring”, and so SQL Server does not attempt to recover it, does not notice the damage and so the restore could complete. In a sense, the restore proceeded from this point just as would the restore of any user database. Once the restore of model completes, we can restart SQL Server as a service without any of the traceflags and it will start correctly.

The process as I’ve described it sounds a little more complex than it is, just because I wanted to make it clear why we needed both traceflags. In practice, this restore method is very quick: start SQL from the command line with the traceflags, restore model backup, kill that running instance of SQL, and restart the service.

Nevertheless, a simpler approach might be to copy the model files from another SQL instance of the same version (preferably service pack too), and use those in place of the missing or damaged files. This lets SQL start normally and then we can simply restore model from backup. If we have backups of the model database files, we can equally well use those.

Yet another option would be to rebuild the system databases (as discussed earlier) and then restore the backups of master, model and msdb. This will take a little longer, as three backups need to be restored, but works just fine.

TempDB location does not exist

As we saw in the previous section, in order to start up, SQL Server has to be able to bring TempDBonline. Missing or corrupt TempDB files aren’t a huge problem here, because if necessary SQL Server can recreate the TempDB database from model. However, the TempDB folder has to exist.

If someone altered the TempDB database, and in doing so specified an invalid location for the file, or the drive that housed TempDB failed, or someone changed the folder names while the SQL Server service was stopped, then SQL won’t be able to restart.

As with most problems with system databases, SQL Server logs the error message to the error log and it looks like this:

Operating system error 3 is, as mentioned earlier, folder not found. It could be that the folder C:\SQLData does not exist or it could be that the drive does not exist at all.

As in the earlier cases, the first step on encountering this error is to identify the cause. Did a drive fail? Did a drive come online after the SQL Server service? Did someone rename the directory? In short, can we resolve the problem quickly by changing or adding a directory? In the example above, it’s far more likely that the directory does not exist than that the C: drive does not exist (seeing as the operating system is running) and as such recreating that directory would allow SQL Server to start.

If a drive has failed, it may be possible to temporarily map a SAN LUN (or even add an external drive) to that drive letter, to allow SQL Server to start. Once SQL Server has started, we can change the defined location of the TempDB files using the ALTER DATABASE command and restart SQL Server, putting the TempDB files into a new location.

If that’s not possible, then we need to start SQL Server without it clearing TempDB so that we can specify a new location that does work. We can do that, as described in the previous section, with the traceflag 3608

SQL starts in the “master-only” configuration, which shouldn’t be a huge surprise at this point, and we can connect via SQLCMD and change the defined location of the TempDB database to a folder that does exist and that SQL Server has permission to access.

1722-44b7def8-329f-4093-94eb-54a014cfed4

Figure 9: Altering TempDB to change the file locations

Assuming we specified the new location correctly, we can now stop the command-line instance of SQL Server, restart the service and SQL Server will recreate the TempDB files in their new location.

Missing Error Log Location

This is a seriously fun and unexpected error! One of the things that SQL Server needs to do to start is to locate and cycle the error log. It locates the error log using the -e startup parameter, if specified, or it looks in the registry if that parameter is not there. Once it gets the location, it renames the existing error log files (Errorlog becomes Errorlog.1, Errorlog.1 becomes Errorlog.2, and so on), creates a new Errorlog and starts logging the startup progress into this file.

If the directory specified for the error log does not exist, startup will fail and there will be no error log to tell you why it failed.

There are two main reasons why the error log directory might be missing:

  • Drive failure (if the error log was on a different drive than the SQL Server binaries)
  • An incorrect modification to the service startup parameters so that SQL Server looks for a location that doesn’t exist

The first hint that this is the cause of SQL Server failing to start is the lack of a new error log. We can find out where SQL Server expects to find the error log by checking the startup parameters of the SQL Server service in SQL Server Configuration Manager, as shown earlier in Figure 3. If the location specified for the -e parameter does not exist, then it’s a rather large hint that the problem is a missing error log directory. It is also possible for the location to exist but for the SQL Server service not to have permissions to access it, in which case, the location will exist, but there will be no error log from the last startup attempt.

To confirm the problem is the lack of an error log, we can check the Windows Event Log, specifically the Application log and, if it is, then there we will find the error message shown in Figure 10.

1722-28e0005c-be84-47db-8fa3-a3554aa8020

Figure 10: Checking the Windows event log for missing error log messages

As discussed earlier, Operating system error 3 is Folder not found. Error 5 would be Permission denied, and Error 32 means that the Errorlog file was in use by another process when SQL attempted to open it, possibly caused by a misconfigured anti-virus scanner.

Once again, the correct resolution depends on the exact nature of the problem. If we’re dealing with a misconfigured anti-virus scanner, which was reading the file when SQL tried to access it, then the resolution may be as simple as restarting the service and fixing the anti-virus configuration to ensure that the problem does not reoccur.

If the SQL Server service does not have permission to access the folder, then we grant the necessary permission and restart the SQL server service. Once the crisis is past, we can find out who removed the permission and why, perhaps a security admin tightening security or implementing a new group policy.

If the problem is that the folder or drive is missing, then we can modify the value specified by the -e startup parameter so that it points to a location that exists, which should allow the SQL Server service to start. If only the folder is missing, we can recreate it and assign the correct permissions. SQL Server doesn’t need to locate the old error log files in order to start; it just needs the folder to be there.

Conclusion

In this article, we delved into some of the problems that can cause a SQL instance to fail to start including missing files belonging to the system databases, account problems and missing error logs. We investigated the causes of those problems, and methods to resolve them and get SQL back up and running.

It is essential that every DBA can troubleshoot and resolve such quickly. I would encourage anyone who is responsible for production SQL Server instances to fire up a VM and work through these problems and solutions. Rename a file belonging to the master database; corrupt model; delete the TempDB folder and practice recovering from the various situations.

That way, if such a problem should ever strike, you will know immediately how to proceed and how to get things back up and running quickly and effectively.

Load comments

About the author

Gail Shaw, famous for her forum contributions under the pen name 'gilamonster', is from Johannesburg Gauteng and works as a database consultant in performance tuning and database optimisation. Before moving to consulting she worked at a large South African investment bank and was responsible for the performance of the major systems there.