Unpleasant things occasionally happen to production databases. Much of this is due to chance. Occasionally, you’ll be unlucky, but your good luck increases with the effort you put into ensuring the robustness of your database. The extent of your occasional misfortune will decrease in direct proportion to the increasing resilience of your database architecture and your level of preparedness.
Your measure of System Availability is determined by your speed of fixing any problems that happen. In this respect, High Availability is not so much a technology as a cultural mindset within the organisation. In order to achieve high availability, you must:
- Be clear on the likely points of failure, the patterns and volume of use, the business requirements, and the strengths and weaknesses of the system architecture,
- Be methodical in reducing risks and both scripting and rehearsing disaster recovery.
- Build in resilience and ‘pain reporting’ into both the software and hardware.
- Be able to fix problems rapidly. Although the fates alone will decide when failures happen, you have more control over the time you take to restore the system’s functionality. To maintain high-availability, you have to reduce this time as much as possible
It is important to learn from the misfortunes of others in order to refine your database’s availability, as there is neither the time nor opportunity to learn only from your own misfortunes.
Measuring System Availability
Calculating the current availability of your database system
Availability is measured from the desktop. Databases can be ‘available’ from the DBA’s perspective when it is actually not available to the end user. An eCommerce site is only available when the customer can trade!
Database availability is a simple calculation
ALTER FUNCTION ufrDatabaseAvailability
@MeanTimeBetweenFailures real, --in hours
@MeanTimeToRealise real, --in hours
@MeanTimeToRepair real --in hours
RETURN ( @MeanTimeBetweenFailures - ( @MeanTimeToRealise
+ @MeanTimeToRepair ) ) * 100.000
/ @MeanTimeBetweenFailures * 1.0000
--so if you get a failure every two months, and it takes you,
--on average 3 hours to realise the
--fact and eight hours to fix it, your availability is...
SELECT dbo.ufrDatabaseAvailability(2 * 30 * 24, 4, 8)
Even though your record is pretty bad, you can claim 99% availability. However, your SLA (Service-Level Agreement) could specify the four nines (99.99%), meaning that you’d have two minutes to discover the fault and two minutes to fix it.
The two nines (99% availability) is relatively easy to achieve by installing proper monitoring and alerting, performing frequent and proper backups, and having precise and clear instructions for disaster recovery.
The Three nines (99.9% availability) can only be achieved once you have resilient hardware, using redundant hardware components, including redundant network cards, routers, and Hot-Swappable RAID Arrays.
Moving to the four nines (99.99% availability) requires all of the ‘Three nines’ and, in aditionaddition, the service must also be maintained during routine maintenance, by using redundant systems. Your total downtime in a year can be 51.6 minutes.
Achieving the five nines would requires requires an annual down-time of just five minutes. To achieve this, you need geographically dispersed systems with automatic failover.
Calculating Cost of Downtime
This is a difficult figure to calculate. It is often quoted by the minute as a revenue figure; ‘cost per minute’. Unfortunately, a failure in the middle of peak trading times is quite different from the same failure at a quiet period. For some production systems, this figure is almost impossible to calculate, and must be established and agreed with the accountants, before downtimes happen.
Calculating the cost of Availability
It is tempting to cost out only the hardware and buildings, rather than the man-hours. However, before a decision can be made on implementing high-availability, the cost of attaining each of the ‘nines’ levels of availability must be estimated for the particular production system. By doing this, you are more likely to be provided with the necessary budget to achieve it, and less likely to be presented with unrealistic demands over availability.
Preparing for Disaster recovery
Before implementing a resilient system, you need to:
- Understand the processing requirements of all applications that are connecting to the databases.
- List all security accounts, security settings, configuration options, databases, instance-level objects, DTS/SSIS packages, data feeds, remote/linked servers
You will need to write a disaster recovery plan that includes every step and detail that must be performed in order to rebuild the servers. And keep it updated with every change. Write it in plain English. Make no assumptions about the specialised knowledge of whoever ends up with the task of rebuilding the servers.
Ensure that there are several ‘first-responders’, people who are trained, able, and willing to deal with virtually any problem or issue.
Fully test your disaster recovery plan at least once a year.
The disaster recovery plan can be maintained as a project, making it easier to plot people, resources, hardware, software, dependencies and tasks.
Minimising unplanned downtime
Keep an up-to-date list of the people who will be responsible for making the server available, with contact numbers etc. Keep information as to who is offsite and onsite up to date. Keep logins and passwords, and software keys up-to-date. Do occasional disaster drill, and refine the plan in the light of what goes wrong.
Do a range of backup types to cope with all eventualities. file groups, for example, can be used to group database objects in a logical way, such as similar usage patterns, to allow rapid restores, so they can be a useful supplement, alongside full backups since file group backups can often be restored whilst the rest of the database is online. One must consider the placement of data within file-groups with an eye to rapid restores. Even where the database becomes unavailable, a file group restore incurs less downtime than a full restore, if it is possible to apply it.
You can speed up the time it takes to have a replacement server ready to switch in, by providing standby servers. These must be pre-configured with at least the correct OS and hot-fixes, and be sufficient in size to hold the production systems. There is some difference in opinion about what the terms ‘cold’ ‘warm’ and ‘hot’ standby means, but here are what seem to be the most common definitions.
This is a spare server, of the same specification of the production server, which is configured and ready to receive a copy of the database taken from the backups of the production server.
This is a redundant server with a mirrored, or log-shipped, copy of the database that requires only a manual intervention to failover, and promote it to being the production server.
This is a server kept in sync with the production server and able to detect failure of the production server and automatically failover without the need for manual intervention. The ideal is a geographically dispersed failover cluster.
Server synchronisation for Warm or Hot standby can be achieved by:
- Log Shipping
- This is simple, cheap and dependable. It can maintain a ‘warm standby’, but not ‘hot standby’, at a distance, but you will need to script the role-switching, login-synchronisation and client redirects, in order to minimise downtime.
When failover occurs, one may lose transactions from the time of the last transaction log transfer. As log shipping copies the database rather then the entire server, it does not copy new logins, database users, DTS/SSIS packages, Agent jobs etc. These must be done separately. It can be implemented without downtime, but will always require manual intervention for failover.
SQL Backup now provides a very slick log-shipping facility that requires less bandwidth than the standard log shipping and is faster.
- This is a new technology in SQL Server 2005. Like Clustering, it can manage a failover at the database level, and thereby maintain a ‘hot standby’. It requires brief outage of production server during synchronization but, unlike Clustering, does not need a high-bandwidth link, specially certified hardware, or special skills to set up and manage.
It uses log records to synchronise the databases and can quickly fail over to the standby server. Client applications have to be coded to automatically redirect their connection information and, in the event of a failover, automatically connect to the standby server and database. Database mirroring, however, can fail over quickly with no loss of committed data. Mirroring uses a ‘Witness’ member of a database mirroring session to determine if a failure has really occurred.
When the partner servers in a mirroring cannot see each other, the partner servers contact the witness to see whether the witness can contact the other partner
- Failover clustering
- This requires the whole setup to be designed and built from the ground up. You will need specially certified hardware, modification to the applications, transfer of all databases, and a great deal of patience and methodical work to set up and maintain.
- By using a command-line version of SQL Data Compare, it is possible to keep databases in synchronisation in order to maintain a ‘Warm Standby’. However, it is effective only for small databases, puts loading on the server, and synchronisation can be delayed by up to ten minutes, depending on how often the synchronisation takes place. This requires manual fail over, as with Log Shipping
- This is reasonably simple to set up and requires no backup restoration as synchronisation is done via a snapshot. This is not really a high-availability option but is included here as it is sometimes suggested. Most useful for Web Farm applications, using merge synchronisation, but, generally, there are too many single-points-of-failure in such a system to qualify it as an alternative.
Reducing planned down-time
The use of enterprise-quality servers will assist in reducing planned downtime. By using hot-swap RAM and RAID drives to add memory and disk storage, for example, you can avoid any downtime for such upgrades.
There will always be maintenance tasks that require that the server is taken off-line, however briefly. For example, if you have only one server and no standby then performing any maintenance that requires a system restart, such as replacing failing system components, or installing service packs, will require planned system downtime.
With a standby, you can do rolling upgrades by performing a manual failover to switch the workload off the node that requires maintenance, thereby maintaining the service and avoiding any downtime.
Every database will also require routine maintenance tasks; tuning needs to be performed, back ups need to be executed, indexes need to be maintained, and so on. In most cases, SQL Server will allow these to take place whilst maintaining the service. There is usually a way of selecting a technique and scheduling the task so as to maintain the service. For example, by using the DBCC INDEXDEFRAG statement, which does not hold locks, you can avoid blocking any running queries or updates.
Taking precautions against failures
Although one cannot prevent failures, one can take all reasonable precautions. The most obvious and most important precaution to take is to put in place a well-designed and implemented backup regime, taking into account the special requirements of the application. This is a subject beyond the scope of this article but, nonetheless, it is worth emphasising its importance. Also…
Process failure precautions
- Control access to the server and server room
- Ensure that all of your team clearly understand their roles and responsibilities.
- Implement change controls to ensure that all software and hardware changes to a production server are documented. Because Change control systems require the signoff by several team specialists it allows them to check for potential problem .
- Document and map all of your SQL Server instances, being particularly careful to record application relationships such as replication or log-shipping, data-feeds, message routes, links, remoting, and file transfer routes.
- Make sure your Test servers are identical in configuration to your Production servers.
- Before applying patches, hotfixes, and service packs, test them first on a Test Server.
Change failure precautions
- Document all proposed changes
- List the expected impact on the production system
- Gain consensus and signoff for the changes, as appropriate.
- Test the effect of the changes in terms of functionality and Stress/Load.
- Document the rollback/reversion plan and test it out on those who are likely to be ‘early responders’ to a system failure.
Natural, and man-made, disaster precautions
- Arrange for the service to be mirrored, or held at ‘Warm stand-by’, a long way away. Test out the ability to switch the service remotely. SAN replication is a popular solution, but mirroring is very effective
Hardware failure precautions
- Simulate failure in all likely places to check that secondary hardware ‘kicks-in’ as expected.
- Make sure there is an architecture diagram, and clear instructions for all hardware recovery routines, which are easily understandable to the ‘first responder’.
- Provide generous battery-backup.
- Use redundant power supplies
- Use hardware and software monitoring tools: hardware often gives out warning signs before ‘letting go’.
- Use a RAIDed array or SAN for storing your data, with hot-swappable drives with available spares. A ‘Stripe of Mirrors’ (Raid 10) is probably best practice.
- Install redundancy in storage controllers.
- Place the databases of your server on a different raid array to the transaction log. Locate TempDB on a high performance RAID array. SQL Server cannot function without it.
- Provide both Network card and router redundancy
- Ensure at least ‘Warm Standby’ fallback servers by using clustering, database mirroring, synchronisation or log shipping.
Software Failure precautions
Software failure can happen due to software changes, but also when data changes. Even date changes can cause failure. ‘Code Rot’ is the common term for software system failure when no recent software changes have been made.
- Use Change and source control (see change failure above)
- Before rolling out a production release, do strict ‘limit’ testing (testing under the extremes of data or throughput, and with hardware components randomly unplugged to assess whether software degradation is ‘graceful’ or not)
- Perform Regular regression testing on the test server with different simulated loads
- Avoid overlapping jobs in the SQL Server Agent; do routine DBCC checks and re-indexes of tables at off-peak times.
Network Failure precautions
TCP/IP is designed fundamentally as a resilient system in the event of disaster, but this relies on the network infrastructure being able to route network packets via alternative pathways in the event of the failure of a pathway.
- Secondary DNS/WINS servers must be provided.
- The system must not be reliant on a single domain server or active directory.
- There should be Redundant routers/switches
- Redundant WAN/Internet connections are generally important.
- Ensure that there is no single point of failure in the network by regular ‘limit-testing’
Security Failure precautions
- Ensure the physical security of each SQL Server.
- Create alerts and reports for any unusual patterns of user activity on the server, and investigate them (SQL Data Compare is very handy for this)
- Give users the fewest permissions they need to perform their job.
- Audit all login and logout events
- Use DDL triggers to log and notify all changes to the security configuration of the server.
- Adopt all current security best-practices when implementing the Server
It is a mistake to believe that high availability can be implemented merely with a chequebook. Although there are some attractive technologies around that will minimise your downtime in most circumstances, they are only part of the solution. At the heart of every robust system, there is planning, documentation, scripting, testing and drill. Lack of budget is not a barrier to achieving high availability. There are several solutions that do not have a high cost attached. There is much that can be achieved by exploiting the features available in SQL Server, in the use of SQL Server tools, and in designing a resilient architecture.
- Best practices for highly available HP Integrity servers and Microsoft SQL Server
An excellent example of a test of a highly-available setup, with good examples of the sort of System and Functional tests that underpin high-availability. One cannot quarrel much with their hardware best-practice recommendations, either