Sometimes a query may be very simple, yet tuning it will be quite a challenge if you are querying large amounts of data; it requires some ‘out-of-the-box’ thinking to make sure that the performance is reasonable. For example, suppose that we store data in the following tasks table:
1 2 3 4 5 6 7 8 9 10 11 |
CREATE TABLE dbo.Tasks ( TaskId INT NOT NULL IDENTITY, TaskDescription VARCHAR(50) NOT NULL , StartedAt SMALLDATETIME NOT NULL , FinishedAt SMALLDATETIME NOT NULL , CONSTRAINT PK_Tasks_TaskId PRIMARY KEY ( TaskId ) , CONSTRAINT CHK_Tasks_StartedAt_Before_FinishedAt CHECK ( StartedAt < FinishedAt ) ) ; |
Imagine that we need to retrieve all the tasks that were happening at some particular point in time. It is simple, for example, to retrieve all the tasks that are happening at midnight of Dec 3rd, 2010.
1 2 3 4 5 6 |
SELECT TaskId , TaskDescription , StartedAt , FinishedAt FROM dbo.Tasks WHERE '20101203' BETWEEN StartedAt AND FinishedAt |
However, if the table grows large, this query will perform very poorly for some dates, no matter which indexes we build on the table. The reason is simple: to satisfy the query, SQL Server may have to scan a large range of an index. For example, if we have a covering index with StartedAt as its first column, and we have a lot of tasks beginning from 1970, SQL Server will have to scan a range beginning from the earliest StartedAt and ending at midnight of Dec 3rd, 2010. This can require a lot of pages to be scanned, and so it will inevitably be slow.
If all tasks are short, we can speed up the query
Suppose, however, that we know with absolute certainty that all tasks last no longer than 24 hours. If this is the case, then we can narrow the range that SQL Server has to scan from 40 years to just one day, as follows:
1 2 3 4 5 6 7 8 |
SELECT TaskId , TaskDescription , StartedAt , FinishedAt FROM dbo.Tasks WHERE '20101203' BETWEEN StartedAt AND FinishedAt AND StartedAt >= '20101202' AND FinishedAt <= '20101204' ; |
Inevitably, this query runs dramatically faster than the previous one, but can we guarantee that it is correct, and that it always returns all the rows it should? In fact, the query is correct as long as no task stored in the table lasts more than 24 hours. We can enforce this business rule with a simple CHECK constraint. Our first attempt at doing this constraint looks OK, but allows tasks longer than 24 hours:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
ALTER TABLE dbo.Tasks ADD CONSTRAINT CHK_Tasks_TaskDuration CHECK(DATEDIFF(hour, StartedAt, FinishedAt)<=24); GO SET LANGUAGE US_English; GO INSERT dbo.Tasks ( TaskDescription , StartedAt , FinishedAt ) VALUES ( 'sample task' , -- TaskDescription - varchar(50) '2010-12-03 ' , -- StartedAt - datetime '2010-12-04 00:59' -- FinishedAt - datetime ) GO SELECT TaskDescription , StartedAt , FinishedAt FROM dbo.Tasks GO ALTER TABLE dbo.Tasks DROP CONSTRAINT CHK_Tasks_TaskDuration |
However, our second version of the constraint does not allow tasks longer than 1440 minutes, which is exactly 24 hours:
1 2 3 |
ALTER TABLE dbo.Tasks ADD CONSTRAINT CHK_Tasks_TaskDuration CHECK(DATEDIFF(minute, StartedAt, FinishedAt)<=1440); |
As long as we have this constraint, we can safely tell the optimizer to scan only a narrow range, only 24 hours of data – our query will read only a small part of the data and run fast.
If some tasks can last for a long time, we need a different approach
The query optimization we have just discussed can only be used if all tasks are short. If some tasks last for a long time, we have to scan a large part of an index. If there is no limit on tasks duration, we may have to scan even wider ranges of data. Either way, if we really need to speed up the query, we have to try out other approaches.
Suppose that instead of storing one long task as one row, we store that task as a sequence of short periods of time, each period being no longer than 24 hours, as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
CREATE TABLE dbo.Tasks ( TaskId INT NOT NULL IDENTITY, TaskDescription VARCHAR(50) NOT NULL , CONSTRAINT PK_Tasks_TaskId PRIMARY KEY ( TaskId ) ) ; GO CREATE TABLE dbo.TasksByDay ( TaskId INT NOT NULL, StartedAt SMALLDATETIME NOT NULL , FinishedAt SMALLDATETIME NOT NULL , CONSTRAINT PK_TasksByDay_TaskId PRIMARY KEY ( TaskId, FinishedAt ) , CONSTRAINT PK_TasksByDay_Tasks FOREIGN KEY ( TaskId ) REFERENCES dbo.Tasks( TaskId ), CONSTRAINT CHK_TasksByDay_StartedAt_Before_FinishedAt CHECK ( StartedAt < FinishedAt ), CONSTRAINT CHK_TasksByDay_TaskDuration CHECK(DATEDIFF(minute, StartedAt, FinishedAt)<=1440) ) ; GO |
If all our data is correct, we can safely tell the optimizer to scan just a narrow interval, just 24 hours worth of data, as discussed before. Let us keep the constraint that makes sure tasks cannot last longer than 24 hours, and add the columns and the constraints to make sure our periods of time have no gaps, as described in this article by Joe Celko:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
CREATE TABLE dbo.TasksByDay ( TaskId INT NOT NULL, StartedAt SMALLDATETIME NOT NULL , FinishedAt SMALLDATETIME NOT NULL , PreviousFinishedAt SMALLDATETIME NULL , CONSTRAINT PK_TasksByDay_TaskId PRIMARY KEY ( TaskId, FinishedAt ) , CONSTRAINT PK_TasksByDay_Tasks FOREIGN KEY ( TaskId ) REFERENCES dbo.Tasks( TaskId ), CONSTRAINT CHK_TasksByDay_StartedAt_Before_FinishedAt CHECK ( StartedAt < FinishedAt ), CONSTRAINT CHK_TasksByDay_TaskDuration CHECK(DATEDIFF(minute, StartedAt, FinishedAt)<=1440), CONSTRAINT UNQ_TasksByDay_TaskId_PreviousFinishedAt UNIQUE ( TaskId, PreviousFinishedAt ) , CONSTRAINT FK_TasksByDay_TaskId_PreviousFinishedAt FOREIGN KEY ( TaskId, PreviousFinishedAt ) REFERENCES dbo.TasksByDay ( TaskId, FinishedAt ) , CONSTRAINT CHK_TasksByDay_NoGaps CHECK ( PreviousFinishedAt = StartedAt ) ) ; GO |
This solution is not for all situations
As we have seen, to speed up a query we have added columns and constraints. After the change, the data uses up more storage, and the modifications are slower and more difficult.
In other words, the price tag for optimizing a query is quite hefty.
However, in some cases it is worth it. Depending on business requirements, some queries may have to return fast. Still, when we decide whether to use expensive solutions like this one, we need to carefully considered whether it is worth it.
Security Video vs. Blue-Ray Disc
The ways in which data is stored and consumed may be very different.
Let us consider security video scenario: the data must be written real time, and its storage must be cheap. Most of the time the data is disposed of without ever reading it even once. The quality of data must be acceptable, but it does not have to be perfect.
Blue-Ray Disc scenario is on the opposite extreme of the spectrum. The data must be of very high quality, and as such it is very expensive. It can take a lot of time to write it, to improve data quality, but reading must be fast and convenient. In many cases the data is read millions of times, which pays for the expensive and slow writing of it.
Most real life situation are somewhere in between these two extremes. When we decide whether to use an expensive solution like the one described in this article, we need to consider the actual business requirements. More specifically, we need to know if the business:
- Needs 100% correctness of query results.
- Needs fast response all the time
If the answer to both questions is yes, than the solution described in this article fits the bill and is recommended to use. Otherwise we can just use the original query and tolerate its slowness, or not use constraints and tolerate the possibility of incorrect data. Anyway, no one solution is best under all circumstances, which is usual.
Load comments