You have been asked to write a SQL task: by which I mean a discrete unit of work; possibly it’s just a single routine such as a function, stored procedure, a view or maybe even a trigger. You have a specification and a timescale. What next?
Developers often just plunge in and start cutting SQL. Sure, this often works, but the chances aren’t great: A SQL programming task can blow up in your face, metaphorically speaking. If you feel lucky and expect things to go well, then your time to complete the task increases exponentially as you hit reality. If you prepare first, it tends to be linear, and projects are likely to be more predictable
In this article, I’ll explain the way I do it, in a number of stages I do the same thing for any T-SQL job, but for trivial code, it is just a checklist. Other developers do it different ways: you’ll have to ask them how they do it.. All I can say is that this way works for me!
What I’m about to suggest may seem like a lot of ‘extra’ work. I’d reply that, if I could guarantee code that works, performs and scales if you were to put flowers in your hat and walk round in circles singing ‘Pop goes the weasel’, you’d do that wouldn’t you? My method is more conventional than that, though I’m inclined to play the ukulele whilst thinking, which seems to help.
Up-front time estimation and planning
Plan your time, and especially the time you allow yourself to do the various stages that I outline below. It is simply a matter of attaching start times and durations to the tasks. After you use a system like this for a while, you get better at estimating a job, and the time each section takes, because you’re thinking consciously about time-management and about recording how long things take and the time you waste. More importantly, you’ll know if you are slipping at almost every stage, or if you have time in hand which you can use to add quality. Planning for a routine that would only take a day to write and test might seem ridiculous, but think what you could pack into a day? Time is precious and it is nice to know if you’re getting bogged down. You could use a tool such as MS Project to do this but a hand-drawn Gantt diagram sketch is just as good.
Produce the test data, and the automated tests
How are you going to test your routine? You need to plan, and implement a test suite and test harness with a dummy version of your routine before you start cutting code. The reason for this is
- It soon becomes apparent if you’ve made the wrong choice for the type of routine and even the main parameters you’ve chosen.
- If you are forced to focus on studying the data hard enough to reproduce a randomised or obfuscated version suitable for testing, you’ll be confronted with awkward facts about it (such as nullability, ‘dirt’, validity or conflicts)
- You will have a better grip on delivering the tested routine in the timescale allowed, since you can do very rapid automated component-tests, if things don’t go quite as one hopes.
You need to be clear about the different types of testing that you’re likely to perform. There is ‘component testing’, which we all do, and is used to verify the functionality of the T-SQL statement you’re working on and ‘unit testing’, to ensure that the routine as a whole is working to specification. We won’t be dealing with regression testing but we may be required to add ‘hooks’ into the routine to support regression testing. I like to build-in ‘Assertion testing’ which allows a database to ‘clear its’ throat’ after a restore, upgrade or some other downtime to check that each major routine does, basically, what it is intended to do, on a small subset of test data. SQL Development adds two important new tests, performance tests and scalability tests, which need plenty of planning.
As I explain elsewhere, it is not good practice to use your live database data as a test-bed. In fact it is awful from both a technical and legal perspective. It is much better to build test data that looks right, and has roughly the same distribution as the real stuff. This isn’t always easy, since things like postcodes/Zipcodes have to be valid, and a skewed distribution will give misleading results.
This stage might seem painful, but there is a lot of opportunity for reuse and plenty of tools that will help you do this. I use reverse-regexes a lot for this because they are easy to modify and it is surprising how often they get reused. I keep a number of SQL statements for doing routine data generation but a third-party tool is the best thing to use where there are lots of dependencies.
With the data and the dummy routine, you can test to make sure it fails all your automated component tests. If it passes, you’re probably in trouble.
As part of the preparation of the data, you will need to create tables that contain the outputs that are the same as what you’d expect from your routine so you can compare quickly with your routine to verify that it produces the right results. These are required for unit testing.
For component testing, you’ll probably be content with an ad-hoc execution of code. Unit tests will need to be scripted: there are plenty of open-source alternatives such as NUnit, some of which are featured on Simple-Talk. These will also drive your regression testing if you want. Many of these are in C# or VB, but I prefer to use a T-SQL script for unit testing. For the most control, I’m beginning to like PowerShell because, with SQLCMD, this provides the most versatile solution. It is a great feeling to fire up the script, lean back in your seat with a ukulele, and watch out for the red text.
Produce a Development harness.
In the course of developing T-SQL code, you will occasionally want to know the values in variables, temporary tables and other awkward places. If you have all the time you want, and like developing at about the speed that small mammals evolve, then use an IDE. Otherwise, develop techniques that basically use a temporary log and our own debug code. The simple version of this used to be called ‘PrintEffing’ (after printf), but here we ‘print’ to a simple log that records the time automatically: From then on simple SQL gives you timings for the various components of your routine.
You’ll also want to know about inefficient code. At this stage, you shouldn’t rely on the profiler, Time statistics, the execution plan or extended events to test the performance of code. You need something far more rapid and reliable. I like to use two different types of harnesses, a development harness and a ‘verbose’ harness to test a stored procedure in actual use. A development-harness is the SQL equivalent of scaffolding, to show how long all the various parts of a long routine are taking. Developers tend not to agree on how to monitor the performance of routines under different data-sizes. I like to know how long a routine took, often every time it is called, and what the parameters were. I also like to be able to pick two points within a routine and know how long it took to run. I like the information immediately, so one can rapidly try out different techniques. I also occasionally like to graph results in Excel. There are subtleties, of course. Do you clear cache before you do a test-run? Do you force a compilation? It is as well to have the T-SQL to hand to twiddle these knobs, commented out when you don’t want them.
At this point, I check the database’s strategies for audit, error-reporting, and performance monitoring, and comply with them in the harness, as it makes sense to design development and production harnesses together, and an audit obligation could affect performance anyway.
Determine the likely best algorithms, assemble the candidates
Don’t just type in-SQL yet!. You need to consider the most effective ways to do what is required from the routine. Sketch, stare out of the window, or discuss with others. Read about what others have done in similar circumstances, and check out the published algorithms. Experiment, by all means, with all that lovely data you’ve just prepared, warm from the oven. However, get rid of the SQL you produce because you are bound to produce something better once you’ve finished the groundwork.
I’ve found, over the years, that there are a couple of good ways of performing a particular T-SQL function, and many lousy ways. The trouble is that what doesn’t work in one routine is magic in another. Also, beware that the best procedural algorithms for the task aren’t necessarily going to be the best for T-SQL. T-SQL differs so much from a procedural language because things that are difficult in one are often trivial in the other.
Sketch out the candidates
Rapidly produce the SQL to implement the most likely performers, without any trick or fancy code. If they start getting hopelessly complicated, then back off and either re-design or reject.
If the routine is of any length, then use plenty of temporary tables and break the process down as much as possible. Don’t try to do a huge amount of work in on SQL Statement, unless you are very keen to impress your colleagues. With temporary tables, you can inspect intermediate workings and very quickly track bugs. Table variables are fine but more volatile.
At this stage, do not be too concerned about the minutiae of performance. The choice of algorithm is always more important for the big picture, and you are concerned about how neat, easy to understand, efficient and maintainable it is.
Do not second-guess or assume what will take a lot of time. With a decent development harness that give you duration timings, and with the execution plans, you never need to.
Select the most likely algorithm.
Run preliminary performance and scalability tests.
Use the development harness to run performance tests using different sizes of data. The Query Optimiser will make decisions on the actual execution strategy and these will be different with various sizes and distributions of data. (this is why the routines that seem to check out well in development sometimes perform like dogs on production data.) Check the execution plans, but don’t let them dominate your thinking to the extent that they force your design choices. SQL Server should be your servant, not your master
At this stage, you will be using batch statements in a test harness rather than wrapping the code up into a stored procedure or function.
Try to improve the efficiency and neatness of your algorithms. At this stage, you’ll get increasingly confident of which algorithm is best, so select it, but remember the other candidates, just in case.
Think at this stage where you really, really, need to wrap things up into transactions. It can pay to clarify your strategy for error handling, and error reporting at this stage.
For building the code, you’ll have a different mindset. You’ll concentrate on creating the most effective implementation of the chosen algorithm. I delete all my preliminary experiments. This works for me but it might be too extreme for you. I discovered by accident that, if I accidentally lose my code and have to rewrite it, the second attempt is always better because I know where I’m heading and have a routemap in my mind. It takes surprisingly little time as long as you don’t forget it all.
Remember that you are still writing everything as a series of batch statements. Your test harness must allow you to be able to switch easily from this to wrapping things up into a routine since once you have a function, you have very little chance of seeing what is happening in its bowels. It has to ‘go on the slab’.
Then I run performance tests on a large dataset. I like to use a slow server: an aging server that was originally of high quality is always the best, since you can feel the delays without having to run all the timings, and home-in more quickly on the performance pain-points. You run the internal timings, inserting extra timing points if necessary in order to drill down to the point where the problems are likely to be. At this stage you’ll appreciate the reason for my advice of doing a series of statements using temporary tables for intermediate results, rather than trying to untie a Gordian knot of entangled SQL to try to get it into one SQL Statement.
Using performance-development techniques like this, backed up when necessary with execution plan information, wait stats and CPU usage info, you’ll soon see that most SQL Statements go too fast to be measurable, and that there will be only a couple of points at which things get bogged down. This surprises beginners who seem to imagine a fixed-performance-cost for every SQL statement.
You’re satisfied with performance and feel sure that scaling is going to be good. At this point you comment out the development harness, format the code to the standard in force, insert the header information and put the code into its routine. As with component testing, you still need to check that the end-product is what it ought to be, by comparing the result with the result that you have already determined to be correct. However, it is quite possible that you can’t produce enough correct data by an independent, possibly semi-manual process. In this case, you’ll be forced to separate your performance and scalability testing from assertion testing.
When I say ‘comment out the development harness’, I must point out that in real life, things don’t always go that well, and you’ll find that you need to put routines ‘back on the slab’. I save the ‘harnessed’ version as a file. Then, the various timing points are turned into comments, simply because most of the code is a description of what the code is about to do, which makes an excellent comment. I use a regex that leaves a marker directly after the line-comment symbol so that the process can be reversed, turning the comment into a timing point.
In this phase, if this routine is a key component in your database, you are likely to want to plot the execution time of a couple of algorithms against a number of data sizes to gauge performance. At this point, I’m generally more interested in complete time rather than CPU time and wait time separately, though this isn’t necessarily the case for a complex routine that is involving other systems.
Now is the time to check your routine into source control. For the sake of any automatically generated ‘catalogue’ of routines and parameters, you’ll also want to add in the extended properties for the header and the parameters. At this stage, it is useful to complete the filling-in of your timesheet so you have a record of how long each stage took. If you do fixed-price work, your income will benefit from the precision with which you can estimate how long a task takes, and the more you fill-in timesheets, the better you get! If you are paid by the hour, then you know exactly how much to bill! If you are lucky to be a permanent employee, you will make everyone happier by being able to predict accurately how long a T-SQL job will take.
The DBA generally is faced with existing code that has to be healed without any refactoring work. DBAs need different tools from developers in order to support different techniques. They have less opportunity to home in on part of a routine and work on it to reduce its impact on the rest of the server.
The method of developing SQL code is as personal as the way of laying it out, or the individual preferences in the use of various language constructs. So much depends on the team you work for, the policies of the department and individual preferences. If you work successfully using a different approach, then fine, and if you were do describe how, in a comment, I’d be fascinated to learn how you do it.
If this article creates any interest, Phil has pledged to continue the theme by lurching back into the practicalities of a development harness: something he was reluctant to do before explaining the context.