When your ‘Best Practices’ are only your ‘Best guess’ with regards to SQL Server configuration

SQL Server configuration plays a vital role in the health of your SQL Estate. Your needs may vary depending on the purpose of the server, or its role within a business, but a healthy configuration ensures availability and performance for your customers.

We’ve been thinking hard about SQL server configuration in Foundry, Redgate’s R&D division. We know misconfiguration leads to estate performance issues that can hit your colleagues and customers hard.

“It’s the silly things that cause issues later on…”

Our definition of “configuration”

The most common settings — TempDb, Collation, File growth settings, Min/Max Memory, MaxDOP, Cost Threshold of Parallelism — are the settings that need to have a keen eye kept on them on if you don’t want issues to strike out of the blue.

Configuration errors can strike when you least expect it.

But keeping your estate healthy through the lens of configuration is a broader topic than simply tweaking dials. Let’s look at each of the scenarios in which configuration is a prominent concern.

    1. Establishing: The definition and documentation of configuration; what’s ‘best practice’ for us.
    2. CreatingThe act of standing up a new server and making it fit for purpose
    3. CheckingKnowing that the server is running on the correct configuration.
    4. ModifyingRolling out improvements and patches across your estate.
    5. Tracing & TroubleshootingThe diagnosis and resolution of configuration issues

Lets look at these in more detail in turn.

1 ) Establishing

The definition and documentation of configuration; what’s ‘best practice’ for us.

Simple, standardized configuration ensures consistency, and ease of deployment. Developing and documenting ‘ideal’ or ‘Best Practice’ achieves this.

“If I can standardize my configuration I can understand, diagnose and resolve issues faster”

Some go further, developing ‘run books’, ensuring best practice is followed by providing step by step instructions on standing up a new server.

Run books are used to provide a step by step approach to attack the configuration in the correct fashion

Even so, problems arise:

“We had an issue where we lost a client, just last week, another team had changed the documentation, we set up the server based on that info it wasn’t fit for purpose, it didn’t have enough drives, it would have cost a lot of money to fix, so the client retracted their business.”

Documents become outdated quickly. The moment anything is changed on the estate, the run book also needs updating. Keeping on top on this is vital, but time consuming.

“Once the server is been provisioned I still go through the checklist to check that it has been done to the standard, it’s better check then than find out in prod!”

Here we’re exploring options around ‘living’ documentation, that’s easily updated, enabling you to define and document SQL Server configuration from a central, connected location.

Would you like to know more?

2) Creating

The act of standing up a new server and making it fit for purpose.

Even with best practice and a baseline to work from, bespoke needs — from customers and applications — require bespoke configuration. It’s down to you to decide how to accommodate this, whilst maintaining control.

Your customers’ needs won’t always stick to the straight best practice configuration you’ve decided on.

Deviating from the norm produces opportunities for errors to creep in. Poor service to the customer, reputational damage, or even the cancellation of a contract altogether are real risks.

One way to mitigate this risk is by only making a few ‘flavors’ of configuration available to the customer; This balances flexibility with security. Achieving this takes time, as different flavors need refining and adjusting over time in order to provide that balance of performance and control.

Configuration is like a box of chocolates, you never quite know what you’re going to get.

“After discussing, we work together to pick one of the 10 different types, this means that I keep it under control, and the customer gets what they want”

Not everyone has the luxury of time, so in this area we’re exploring ways to empower you to apply a configuration template, fine tune, and deploy to one or many servers as required.

Would you like to know more?

3) Checking

Knowing that the server is running on the correct configuration.

Often, when configuration is set at provision time, it’s left alone for long periods of time. If nothing’s broken then there’s no problem, right?

Except, is it? What if settings are incorrect, the result of a configuration choice that doesn’t immediately create a problem, but lays dormant.

“The common errors are caused by people making simple mistakes. No one thinks “AD accounts?”, or “Max mem setting”, or “cost threshold” — Microsoft don’t address these setting by defaults so you need an experienced DBA to remember.”

You’re not always the person responsible for standing up the server in the first place. Many people are responsible for estates built upon over years by contractors and developers until it resembles a labyrinth that it’s down to the DBA to understand and organize.

Here we’re thinking about ways to simplify the identification, diagnosis and resolution, time which you could be spending on more valuable things to ensure the smooth running of your business.

Would you like to know more?

4) Modifying

Rolling out improvements and patches across your estate.

At some point configurations will need to be reviewed. They may require modifying. Patching and preparing for growth or high intensity workloads are common examples of this.

In most cases, this involved working through each server step-by-step, ensuring the changes are rolled out correctly.

“I had to go through and patch 13 individual servers, that meant going into each individual server in turn and change things manually, it took a lot of time.”

Over time, people develop, collect and curate scripts and tools to handle this. Crucially, it takes time and experience to achieve this.

If you’re not the curator of these tools and scripts, then knowing how to ‘drive’ them requires excellent documentation (see ‘run books’ above) and a robust testing process.

In the modification scenario, we’re looking at ways to provide a better connection between ‘how things should be’ (documentation) and how they are in the target environment. Can we learn from the Infrastructure as Code movement?

Would you like to know more?

5) Troubleshooting

The diagnosis and resolution of configuration issues

Everyone strives to be on top of their estate, able to react quickly to issues and ensure applications meet their SLAs.

Managing your estate can be like the Wild West, how do you ensure you keep law and order?

A substantial amount of time is dedicated to ensuring alerts — notifications of issues — are configured correctly, raising the alarm before issues become crises.

But what about dormant misconfigurations? They won’t spark an alert because things are running ‘as expected’. Until they aren’t. The longer these issues remain dormant, the greater the distance between symptom and root cause.

We’ve heard horror stories about restaurants unable to process any payments at peak times due to issues caused by a config setting. It’s also often the customer that notices the performance drop before the DBA due to the lack of reporting or alerts set up to proactively monitor these settings.

Last year Garry Bargsely wrote “A day in the life of a production DBA” and noted how an urgent alert was traced back to a TempDB setting:

“Around 1:45 PM we had a Severity 1 alert fire off relating to TempDB filling up on a production server. While I was not on-call, I decided to look around to see what caused this issue. At about the same time the on-call DBA responded to the alert and had fixed the issue. We had user running an Ad Hoc query that filled up the TempDB drive. This showed a gap in our TempDB settings as we thought our configuration globally was to auto-grow TempDB manually to the proper size for the drive then disable auto-growth. So I have added a KanBan task to perform an audit of all our servers.”

This incident was resolved quickly, but a full audit of the whole estate to review configuration was still required as a result. For some, the worst case scenario involves engaging with (expensive) specialist consultants.

In this scenario, we’re exploring ideas around developing faster detection method for dormant symptoms, followed by faster diagnosis.

Would you like to know more?

Let’s talk

We’re in the early stages of researching where the main challenges lie within these 5 stages of a configuration process. To ensure that we continue to work on the right problems and develop the right tools to support our customers both now and in the future.

If you have any thoughts, ideas or experiences you feel would be beneficial to aid us in this journey, please sign up and we’ll arrange a time to talk.