Exchange Server Log File Replay

Exchange Server stores information in a database and uses log files for transactional processing. To restore, defragment  or repair a database, the ESEUTIL tool is essential . It is always possible to recover data when the database is lost, if you have backed up the database.&

 

In my previous article, Exchange Database Technologies, I discussed the underlying database technology in Exchange Server, the Extensible Storage Engine or ESE. One of the most important points in that article was that all changes to the Exchange Server database go through the log files. This is done for recovery purposes. Let’s look at the log files, and the replay of log files in case of a recovery scenario…

Creation of the database

After you create a Storage Group in the Exchange Management Console, we have an empty directory on disk. The only thing that actually happens is setting a property for the Storage Group in Active Directory, so no log files have been created yet.

When a database is created in the Exchange Management Console there is still an empty directory on the disk. Again, the only thing that happened is setting a property in Active Directory.

When the database is actually mounted, these things will happen on disk:

  • Active Directory is checked for the location of the log files;

     

  • When no log files are found a new set of log files is created with an lGeneration number of 1;
  • Active Directory is checked for the location of the database;
  • Since this is an initial mounting a new database is created;
  • At this point log files, a database file and a checkpoint file have been created and the Exchange Server’s database is ready to use.

Tip.
In the old Backoffice 4.x resource kit there was a small utility called “mailstorm”. This MAPI utility was very useful to send a massive amount of messages in a short timeframe. Using mailstorm it is possible to see the creation of log files. Unfortunately mailstorm doesn’t work with Exchange Server 2007 anymore, but there’s a PowerShell script available having the same functionality. Using this script a variety of of predefined messages can be sent to your Exchange Server 2007 to check the log file creation.

Create a user account in Active Directory and create a mailbox using the Exchange Management Console. Log on to the mailbox and start sending messages until you have a couple of messages and a couple of log files.

What happens when the database file is lost? If we have all the log files still available it should be possible to retrieve all information. Remember what I wrote in my previous article: “everything is logged in the log files, even the creation of database files!”

If you dismount the database, delete the database file “mailbox database.edb” or whatever name you have given the database and try to mount the database again, the following error is raised:

572-image2.jpg<
Figure  1. Error message when a mailbox database appears to be missing

In my humble opinion, the yellow exclamation mark should be replaced with a very large red “X” since this is a very important message. When you click “Yes” a new mailbox will be created in the same location as the “old” database. This is a completely new database. Although it has the same name “mailbox database.edb” as in the previous step, it has new signatures, as explained in my previous article, which Exchange Server uses to link databases and log files file together. Recovery of old log files will not result in information being replayed into the database because it is another database in this scenario. And remember, since all information is logged into the log file the creation of this new database is also logged. Choose No, and then delete the checkpoint file E00.chk and try to mount the database again. No error message is raised and the database is mounted. Even better, when you log on to your test mailbox you will see that no information is lost either!

This is what happens when you do click “Yes”: during the mount process Exchange Server cannot find the database file and it cannot find the checkpoint file. Therefore it starts to recover all information by replaying the available log files.  It starts with the oldest log file E0000000001.log which also contains the creation of the initial database file. All information in the other log files is replayed into this database until it reaches the end of the last log file E00.log. The database is mounted and ready to use.

When Exchange Server cannot find the database file but it does find the checkpoint file it will not replay the log files. It starts at the end of the last log file E00.log and it will create a new database.

How can you tell which files belong together?

Dismount the Exchange Server’s database, open a command prompt in the database’s directory and check the header information using the ESEUTIL tool. You will find the following information:

When you check the header information of the log files with the ESEUTIL tool you will find corresponding information:

Note: both screen outputs have been edited for readability

As you can see both the log file signature and the database signature match, so these files belong together. When you (accidentally) create a new mailbox you will find other information in the database header:

As you can see the log signature hasn’t changed (still the same set of log files) but the database has a new signature, meaning that although the database has the same name and it is in the same location it is a new database!

Key take-a-way: the checkpoint file determines if log files are replayed and where log file replay will start and therefore what happens during the mounting process.

Offline backups

You can create backups by copying the database files to a safe location. The steps to do so are:

  1. Dismount the database (meaning it is not available!)
  2. Copy the database file to a safe location
  3. Mount the database
  4. Perform a consistency check on the database copy
  5. If everything is ok, delete the log files

The first three steps do not need any further explanation. But what log files can you safely delete?

When dismounting the database all information in the log files that is not yet committed to the database is flushed to the database file. When all data is flushed the files are closed.

You can check in the database header information when the database was dismounted by looking at the “last detach” information. Also check that the database is in a clean shutdown and does not need any log files for mounting.

So in this specific scenario, all log file older than E0000000011.log can safely be deleted. All information in these log files is flushed to the database. Why not log file E0000000011.log itself? There can be information logged in the log file in this same log file beyond this point after mounting the database again.

The offline copy of the database has also been checked for consistency. A database can contain corrupt pages, and as long as these pages are not read by the Exchange server you will never know these pages are corrupt. Suppose somebody is on a one year sabbatical leave and his mailbox data is never accessed, corrupt pages can exists for this period of time without being noticed.

So you have to check the offline copy for any inconsistencies using the ESEUTIL tool with the /K option:

When using the /K option all pages in the database are read and their checksum information is read and checked. When everything is ok, you can safely continue.

Note. Starting with Exchange Server 2003 Service Pack 1 Exchange has error correcting code for checksum errors. If a certain page contains a checksum error, which is usually caused by only one bit being “flipped”, Exchange Server can automatically correct this bit. When the page is flushed to disk the error is automatically corrected. Please check the Microsoft kb article 867626 (Microsoft kb article 867626) for more background information regarding this change.

 Offline Restore

When something happens we have to restore the offline backup on the Exchange server. An offline restore requires the following steps.

  1. Copy the offline backup mailbox file to the correct location<
  2. Replay the log files
  3. Mount the database

Now remember the first part of this article. If we leave the checkpoint file in its original location the Exchange Server will check this file and will determine that no replay is needed and the copy of the database will be mounted.

If we delete the checkpoint file the Exchange Server will start replaying log files from the oldest log file available. This will result in all information after the offline backup being replayed into the copy of the database. The result will be a complete up-to-date database.

If you check the event log of the Exchange server you can see which log files were replayed. For every log file an entry is logged with ID 301 from the ESE Source.

572-image1.jpg
Figure 2. Log file replay events in the event viewer

Besides just mounting the database it is also possible to replay the log files into the database manually using the ESEUTIL tool. This can be achieved using the /R option for Recovery.

Note. the drive letter K: immediately follows the /d option, there’s no space in between.

Online Maintenance

There’s an internal process in the Exchange server that’s responsible for the maintenance of the mailbox database. This is called the online maintenance, or OLM. OLM runs every night, by default between 1 am and 5 am. Online maintenance consists of 10 tasks that are performed during this period. Five of them are specifically for the Public Folder database and five of them are for the Public Folder database and the mailbox database. I want to highlight two of them in this article:

OLM checks the mailbox database for deleted messages that are beyond their retention time. If there are, they are fully deleted from the database and their pages are freed up (that is, marked as “white space”). OLM is also responsible for permanently deleting mailboxes that are beyond their retention time. If messages and mailboxes are beyond their retention time they are really gone, the only way to get these back is by restoring an old backup of the mailbox database. Archiving solutions can help here, but they are out of the scope of this article. When the mailboxes and messages are permanently deleted the pages they used are freed up and available for new messages. But the free pages are scattered across the database. The OLM also performs an online defragmentation of the database file by consolidating freed up pages from the previous step and if possible, placing them in one contiguous block of free data. This makes the addition of new data to the database more efficient since the data doesn’t have to be split and separated across multiple segments of the database.

Online Maintenance is a very disk intensive application and it should be monitored very closely. If the Exchange server cannot finish the OLM in the designated timeframe errors are logged in the event log. Carefully examine these errors and check what causes them.

Make sure that no other processes are working with the mailbox database at the same time. Typically maintenance processes are scheduled at night, such as backups or archiving. I will cover online backups in the next article in this series. The movement of large numbers of mailboxes, which is also often scheduled during the night, will also have a negative impact on the OLM.

For a complete overview of the online maintenance please visit the Microsoft Exchange team blog

Key take-a-way: Online Maintenance is responsible for online defragmentation; it defragments inside the database only. This does not result in a decrease of the physical size of the mailbox database.

 Offline defragmentation

Please note that offline defragmentation should never be considered a part of normal maintenance of an Exchange database. Only if your free space exceeds 30% of your total database size, or if you are told to do a defragmentation by Microsoft Customer Support Services, should you consider doing an offline defragmentation.

While the online maintenance does defragmentation within in the database, it does not compact the database in physical size. If you have a 100 GB database and it has only 10 GB of data in it, the online maintenance will eventually free up 90 GB within the database (it will be marked as “white space” or “available space” within the logical tables of the database). The file “mailbox database.edb” as it resides on disk still remains 100 GB. To compact the database we have to perform an offline defragmentation. This can only be done using the ESEUTIL tool with the /d option. As the name implies, this has to be done when the database is offline.

If you look closely at this output you can already see what happens during an offline defragmentation. A new mailbox database file is created is with a random name, in this case TEMPDFRG2256.EDB. The data from the original mailbox database is copied to this temporary database. Of course existing white space in the original database is not copied to the new database. When finished copying the temporary file is renamed to the original file and is now called “mailbox database.edb”. So this is a new file, containing only the database that resided in the original file. In the previous example with the 100 GB database with only 10 GB of data, when an offline defragmentation is performed we will be left with a new 10 GB database. There is some overhead in this, so the sizing can vary +/- 10%.

You must realise that this is a new database with new signatures etc. so there is no possibility of recovering theto recover data in existing log files into this new database. This is the reason you should make a new backup immediately after performing an offline defragmentation.

Please note that if you do not redirect the temporary file created by Offline Defragmentation, the disk volume containing the Exchange mailbox database should have at least 110% of the size of the mailbox database for available disk space.

In the Exchange Server 5.5 timeframe it was a best practice to perform an offline defragmentation every months. This would create a new database with new tables, indices etc. Microsoft has spent a tremendous amount of development in the database engine. Nowadays the database engine is that good and stable that an offline defrag has only to be performed after a repair action, or when large amounts of data can be freed up. This can be the case after the deletion of a large number of mailboxes or after moving a lot of mailboxes to another database.

Worst case scenario: Repairing your database

The worst thing that can happen when running Exchange is that your server crashes and your database refuses to mount. After checking if the Information Store is running and maybe rebooting the server it still won’t mount and the following error is raised:

572-image3.jpg
  Figure 3 – Error message when the database won’t mount after a crash

In this example one of the log files needed for the mounting process is not available anymore (I deleted one of these log files for demonstration). If this occurs it can be checked by comparing the values in the “logs needed” section of the database header against the actual log files that are on disk.

So now we have a database that is not consistent anymore and that has to be fixed before it can be used again. To accomplish the ESEUTIL tool should be used again but now with the /P option (for repair).

Note. Only perform this step when you have made a copy of all databases, log files and checkpoint file and placed them in a safe location. Performing a repair action can result in data loss!

When you enter this command an error is raised stating that if you continue information might be lost. What actually happens is that ESEUTIL checks all information in the database and it checks all tables and pointers. If pointers do not point to the correct location, or information in this location cannot be read, the pointer is deleted and the index entry is cleaned up. All information that the pointer points to will be lost!

At the end of this operation you will have a database that is in a consistent state, but indexes in the database might be non-contiguous and thus less efficient..As a Microsoft best-practice an offline defragmentation should be performed after a database repair.  This will create a new database, with the existing data but with new tables, indices etc. Both ESEUTIL /P and ESEUTIL /D will fix issues on a low (database) level but not on an application level. After performing a repair and on offline defragmentation on your database you have to run the ISINTEG tool. This will check the database on an application (=logical) level and if issues are found fix them. When completely finished do not forget to backup your database immediately.

Note. Perform these steps only if you have backed up the database(s), log files and the checkpoint file since performing a repair action can lead to data loss!

To be 100% sure that all issues are fixed on all levels you can create a new mailbox database and move all mailboxes from the repaired database to the new database. When all mailboxes are moved you can delete the repaired database. But please be aware that a repair action with ESEUTIL can delete data, so despite all activities there might be circumstances that end users mail can be lost. Unfortunately it is not predictable what and how much will be lost.

You can find more information on the Microsoft website: http://technet.microsoft.com/en-us/library/aa997152(EXCHG.80).aspx and http://msexchangeteam.com/archive/2004/06/18/159413.aspx

Note. Of course it is also possible to restore a backup of the database and recover the existing log files up to the point of the missing log files. If possible this is most likely the best and fastest option you will have.

When using ESEUTIL, there is one very important issue. ESEUTIL is very powerful, but not very fast. Depending on the hardware in use for the Exchange server, as a rule of thumb ESEUTIL will process the Exchange database with a rate of between 5 GB/hour and 10 GB/hour. When you have a 250 GB Exchange database, a repair action on your database would take 25 hours or longer to complete A following offline defragmentation can also take up to 25 hours, so you will see an outage of at least 50 hours. I don’t take the ISINTEG tool and moving the mailboxes to a complete new database into account, but I’m pretty confident that this will not match your Service Level Agreement (SLA). On an Exchange server using a single copy database, i.e. a single server or a single copy cluster, it is best to use a maximum database size of 50 GB. This will keep your database in a manageable state from a service level point of view.

So why does Microsoft suggest using a maximum 200 GB database size in an Exchange Server 2007 Continuous Cluster Replication  (CCR) environment? As explained in my first article, you have an extra copy of your database available in your Exchange 2007 cluster. So when the database of the active node crashes and is beyond recovery (i.e. cannot be mounted anymore) the passive copy will take over and start servicing your (Outlook) clients. When this happens you have time to start investigating the broken node of your cluster and repair it. You are not tied to the 5 GB/hour processing time of ESEUTIL, which ought to make you more comfortable with your Service Level Agreement.

Conclusion

Exchange Server stores information in a database and uses log files for transactional processing. This way it is always possible to recover data when the database is lost, if you have backed up the database. The tool for database maintenance is ESEUTIL. Using ESEUTIL you can check your database, recover data from the log files and replay it into your database or repair your database when it’s corrupt.

When you need to compact your database, for example after you deleted or moved multiple mailboxes, you have to perform an offline defragmentation, but remember to create a new backup after performing an offline defragmentation.

One important thing to remember is the checkpoint file. Although a tiny file only 8KB in size it has a major impact on the results when either recovering data using ESEUTIL or mounting the database in the Exchange Management Console.

When you run into issues in your test environment that isn’t a problem. You can and you should start working with database and ESETUIL to see what exactly happens in the different scenarios. Make sure you fully understand (and document!) all the necessary steps for a successful recovery when you lose your database. However, do this in your TEST environment, not in your live/production environment.

When problems happen in your production environment and you face a major outage, ask Microsoft Customer Support for help. Microsoft CSS will guide you through a successful recovery, but when you have some experience in your test environment you know at least what’s happening and why Microsoft Customer Support asks you to perform the various steps.

In my last article I will talk about online backups and VSS backups and explain the differences between these and the steps documented in this article.