Auditing SQL Server – Part 2 – Hardware Audit

Comments 0

Share to social media

This is the second part of my series on auditing SQL Server. In the first part, I discussed basic server discovery and documentation. It covered some items to check at the hardware level and configuration items, but this section gets into more detailed hardware auditing details. It took me a while to decide how to best present the approach for this audit. My first two drafts included hardware, server, and database audits in the same document. This is simply too much, so I have divided these three sections. The current section focuses on hardware and cloud configuration items. I have placed the list of items I would recommend collecting at the start of this segment. Depending on your role in the organization, you will likely need to have others gather some of these configuration items and metrics for you.

Next, I discuss how to organize and evaluate these items. The most important part of an audit is the analysis. Server and hardware items can have a large impact on costs. I’ve seen cloud misconfigurations that cost an extra $15K per month for logging alone and I’ve heard about more expensive cloud mistakes. Hopefully you won’t find errors or over-allocations this egregious. But that’s a primary reason for auditing. You can’t assume everything is configured correctly. There are many things that “shouldn’t be” or “I’m sure aren’t” misconfigured, but memory is an imperfect thing. If a project is delayed or cancelled, another solution is used, or teams change, it’s very easy to allocate a server or database and then not use it or for items to be over-allocated. This is especially easy to do in a cloud environment since you don’t have a physical server to remind you of the issue.

Finally, I present some scripts that can be used to dig into the details. I don’t have a script for every item, but I cover most of them. There are always multiple methods to audit SQL and hardware items. I find scripts the easiest and most consistent method, especially if I need another group of admins to gather the results for me. The output is always predictable and reliable. With screenshots from the GUI, it can be frustrating and require multiple sessions. Script output is also easier to put into a tool, such as a database, for further comparison and analysis.

Hardware Audit Scope

The focus of this part of the audit is understanding the hardware (or Azure tier) your SQL Server runs on. If the server you are examining is in the cloud, you will want to look at cloud specific options. These items were covered briefly in the discovery audit, but will be covered in more detail here.

To make this a little more palatable and easier to read, the first part is just of list of items to validate. The evaluation section discusses what these items mean in terms of your audit results. I have probably missed some items that are important to your audit, so be sure to add them to your personal template.

In the discovery portion of your audit, the servers in scope for this audit were documented. It is assumed that all of the servers in the initial audit are having the same detailed analysis here. You want to know if there are differences between dev, test, prod, etc. That is another function of the audit – understanding those discrepancies and possibly providing a path to remedy the differences. I put a small section detailing some important aspects of that comparison as well as items that can be disregarded.

General Audit Template

  • Server / Azure infrastructure
    • Hardware capacity, configuration, and usage / Azure Tier
      • CPU / Cores
      • Memory
      • Drives / DAS / SAN / SSD / Memory drives
      • Network configuration
    • Server hardware
      • Make
      • Model
      • Azure / cloud allocation
    • Server configuration
    • Services running on server
    • Backups
    • OS version
    • Patch level
  • Environment differences
    • Server configuration

Evaluating Hardware and Azure Results

The server and infrastructure evaluation focuses on hardware specs, including cloud allocation, other services running on the server, backups, and the OS / patch level. It has a direct impact on costs, business continuity efforts, security, and overall performance. You are looking for items that are wildly misconfigured, out of date, misallocated, or simply wrong for the server.

Hardware capacity

The hardware specs have a direct impact on sizing and cost. The CPU cores impact licensing. In Azure, the pricing tier determines cores, memory, and I/O. A lot of money can be wasted on having a large server when it isn’t needed, or being in a performance tier that isn’t needed. Your audit should check for over-utilized resources (i.e., CPU at 95-100%) or consistently under-utilized resources (i.e., maximum CPU at 10% for the week). In a cloud scenario there is flexibility. The development environment can be removed and recreated automatically further reducing costs. Serverless models allow the CPU to be adjusted automatically based on usage which can greatly reduce costs. Check these options in a cloud environment. Resources that are over-utilized are a critical item and should be addressed as soon as possible. Under-utilized resources are generally not critical, but need to be noted and fixed when possible – unless there is a severe issue.

Server hardware

As a developer or DBA, you won’t be too concerned with the actual hardware for your SQL Server as long as it provides the resources needed. As a manager or owner, you want to be sure the hardware is up-to-date and supported by the vendor. Mission critical hardware will have support plans and should be actively monitored by admins.

Document the hardware, the support model, including the SLA and be sure it is appropriate for the server usage. Critical systems will need to have a much faster response time from the hardware vendor than non-critical systems. Highly critical systems are also candidates for clustering or Always On availability groups.

If the hardware is out-of-date, not actively monitored, the SLA doesn’t match the need or is incorrect in some other way, this is a critical item. In Azure or other cloud environments, the performance tier should match the business need. The action plan will depend on the issue. It might to acquire new hardware, add cores or memory, reconfigure the SAN LUN or add a more performant LUN, or adjust the cloud performance tier. This usually requires working with other teams and often involves modifying the budget.

Server configuration

Documenting server configuration is important. Server specific information needs to be gathered so you can create a new server in a disaster scenario. This needs to be validated and documented in this part of your audit. These configuration items should be part of the standard server build or included in backups.

Services

For an on-prem server, and any VM you control, you want to document non-standard services installed. The general rule-of-thumb is no services except SQL Server. If there are additional services installed, document them and be sure it is appropriate for them to be on that server.

If the server is functioning well, this won’t be an immediately critical item, but you will want to recommend moving other services off of the server running SQL. If there are performance issues and the additional services are resource intensive, it becomes critical. If you are performing the audit due to performance concerns, it may already be critical to remove extraneous services. Document these additional services and work with the teams on a plan to get them moved.

Backups

A plan for restoring the server needs to be present. It may not be actual backups, but the plan, and ability, to spin up a new server and install SQL on that server needs to be ready. The configuration, firewall rules, security, SSL / TLS certificates, and anything else needed for the server to run must be ready.

If backups aren’t happening or there isn’t an automated build process that is tested and meets the SLA, it is a critical item.

OS Version and patches

The health of the SQL Server depends on the health of the underlying server. In Azure managed instances and Azure SQL Database, this is managed for you. For on-prem servers or VMs you manage, you want to be sure the server version is supported and the patch level is up-to-date for the organization. That may not be the most recent patch, but it should fall within the enterprise standards. This has a direct impact on security and usually has an impact on performance as well.

Document the OS version and patch level. This is another critical item. If the OS is not supported or the patch level is behind enterprise standards, document it and make sure it is highlighted as an issue. Enterprises have different standards for when patches are applied, so it doesn’t necessarily have to be the most current patches. But it does need to be reasonably current and within standards.

Server Configuration Environment Differences

When there are multiple environments for a project, application, or set of databases, the environments at a server and SQL Server level will generally be different. The database level items should be minimal or even non-existent.

At a hardware level, expect to see differences in hardware and available resources between production and non-production environments. The exact differences will depend on your organization, but it is ideal for the configurations to be as close as possible, including hardware make and model. This will allow you to run performance testing, full data validations, and troubleshoot issues with actual production data much more easily. It isn’t necessary that they are identical. It can be much more expensive to have production level environments in the lower environments. Configurations options between the servers should be identical. That includes services running, security and, firewall rules should be analogous, and server setup. Document the differences. Look for any inadequacies in the lower environments or over-allocation of resources and ways to reduce costs. Keep the business goals in mind so you don’t recommend reducing resources too much.

Gathering the Server Audit Information

This is an item you will need to work on with your administrators. Many organizations separate the role of server admin, storage admin, DBA, developer, and more. This means you may have to work with all of these different groups, and possibly more, to get the information needed to complete your audit. From a hardware perspective, it is much easier to reach out to each group and get the server configuration information. You can gather some of this information from the query engine or PowerShell, but this section will require help if you aren’t able to logon to the server yourself. Request and document the following, or gather what you can with the following scripts:

  • CPU / cores
  • Memory
  • Drives / DAS / SAN / SSD / Memory drives
  • Utilization statistics
  • Server hardware
  • Network
  • OS version
  • Patch level
  • Server specific configuration options
  • Services
  • Backups

CPU and Memory Information

The following will show CPU information and memory information for the server.

Server Hardware

Basic hardware information can be gathered via PowerShell scripts if you have the correct permissions for the server. Some things, such as external drive arrays (DAS / SAN), won’t be shown here and you’ll still need administrators to gather this for you. There are multiple ways to do this in PowerShell. Other methods might include the BIOS information and serial number information, so choose the method that works for your audit. This shows the physical memory in addition to the make and model and is the reason I included this particular method.

Drive and physical storage information can be much more complicated to gather. Work with administrators to get details. You want the specific type of storage with details. That would include RAID configuration, drive model, drive speed, caching levels, LUN configuration, storage network connectivity and redundancy, and any other physical details needed. Many of these details are only available to administrators of those systems.

Network

Basic network information can be gathered via ipconfig, but it won’t give you the whole picture. Request a network diagram that includes the servers being investigated. Depending on the organization, this may be restricted due to security, but you should be able to get a logical diagram with the relevant details.

The following PowerShell script will give you detailed network adapter information, but it will not show you firewall rules or how the server fits into the network. Get that diagram!

OS Version and Patch Level

The following TSQL will show both the OS version and patch level, as well as the SQL version and patch level. You want to verify that it is a supported version and the patches are up-to-date.

Azure results are a little different. You are really only concerned with the performance tier if you are in Azure, so this isn’t a concern.

Server Optimizations

Check with the administrators for any server specific configurations and optimizations. There generally isn’t much to do at a server level other than configuring the server for services and making sure no unnecessary services are installed.

Additional Services

It is much easier to rely on your administrators or management software to get a list of services running on the SQL Server. If you want to verify yourself, you can look in the services application or grab them via PowerShell.

There are many services required to run Windows, which makes it difficult to evaluate this list. It is often easier to get RDP access to the server and take a look at the installed applications as a starting point.

Backups

There are many methods and strategies for performing backups at a server level. Work with your admins to be sure this is covered and document the method for recovery. It may be an automated build of a new server or it may be a bare-metal restore, but there should be a plan. Extra points if the restore process is regularly tested.

Summary

Hardware and configuration audits can be as extensive or as limited as you want, as long as you are meeting the purpose of the audit. It’s easy to spend a significant amount of time performing this audit – it took me longer than I expected to summarize everything due to the broad nature of this section. Remember to focus on the critical items. Major items are backups, basic configuration, and property sized hardware. Cloud offerings change rapidly, so be sure to review what’s available frequently. You might be able to get increased performance for the same price or decreased cost without impacting performance.

References

Load comments

About the author

Ben Johnston

See Profile

Ben is a data architect from Iowa and has been working with SQL Server since version 6.5 in the late 90's. Ben focuses on performance tuning, warehouse implementations and optimizations, database security, and system integrations.