Analytic functions have been part of Oracle for a very long time now – ever since 8i back in 1999. Analytic functions are an ANSI/ISO standard, and so you’ll find that they are similarly-implemented across a number of compliant databases. (This SQL Server article on “window functions” from sister site, Simple Talk, could very well have been talking about Oracle.)
Analytic functions extend the power of SQL, enabling you carry out rather complex analysis of your data alongside your plain queries. There is nothing that analytic functions do that you could not achieve with complex self-joins and subqueries or PL/SQL. But you’ll find that often with a single line of an analytic function you can save yourself a hundred lines of code and a migraine.
But I need to be clearer. And a good place to start is with aggregate functions. Aggregate functions – like MAX, AVG, COUNT, SUM and the ilk – are great for presenting a summary view of your data in a single row or, if paired with a GROUP BY clause, in a relatively small number of records. But if all we had was aggregate functions we would constantly be forced to make a choice: view the raw data, or view the aggregated summaries?
Analytic functions, on the other hand, allow us to have our cake and eat it too. Using analytic functions we can, over multiple rows, view our data and pair it with a running analysis. To use a sports analogy, the difference between analytic functions and aggregate functions is the difference between watching the game or simply reading the match report.
But that’s enough about eating cake and watching football; I should give you an example.
Here’s an ordinary, everyday aggregate function that tells us the average salary at our small company:
1 2 |
SELECT AVG(sal) FROM emp; |
AVG(SAL) |
2073.21 |
However, if we asked an aggregate function to show us how the average salary has changed over the years with each new member of staff it would probably burst into tears and run off to suck its thumb in the corner. We need an analytic function for that.
1 2 3 4 |
SELECT ename, job, hiredate, sal, AVG(sal) OVER (ORDER BY hiredate) AS running_average FROM emp ORDER BY hiredate; |
Name | JOB | HIREDATE | SALARY | RUNNING_AVERAGE |
SMITH | CLERK | 12/17/1980 | 800 | 800 |
ALLEN | SALESMAN | 2/20/1981 | 1600 | 1200 |
WARD | SALESMAN | 2/22/1981 | 1250 | 1216.67 |
JONES | MANAGER | 4/2/1981 | 2975 | 1656.25 |
BLAKE | MANAGER | 5/1/1981 | 2850 | 1895 |
CLARK | MANAGER | 6/9/1981 | 2450 | 1987.5 |
TURNER | SALESMAN | 9/8/1981 | 1500 | 1917.86 |
MARTIN | SALESMAN | 9/28/1981 | 1250 | 1834.38 |
KING | PRESIDENT | 11/17/1981 | 5000 | 2186.11 |
FORD | ANALYST | 12/3/1981 | 3000 | 2147.73 |
JAMES | CLERK | 12/3/1981 | 950 | 2147.73 |
MILLER | CLERK | 1/23/1982 | 1300 | 2077.08 |
SCOTT | ANALYST | 12/9/1982 | 3000 | 2148.08 |
ADAMS | CLERK | 1/12/1983 | 1100 | 2073.21 |
Let’s dissect our analytic function and see what it’s made of. Here’s the general syntax:
1 |
FUNCTION(args) OVER ([PARTITION BY <…>] [ORDER BY <…>][window_clause]) |
The OVER keyword is mandatory; however, the other ingredients of the analytic function soup – PARTITION BY, ORDER BY, and the window clause – are powerful but (in many cases) optional. They tell the analytic function how to slice and dice the results. If, in our query above, we left the ORDER BY out, we’d get a very different result.
1 2 3 4 |
SELECT ename, job, hiredate, sal, AVG(sal) OVER () "average" FROM emp ORDER BY hiredate; |
Name | JOB | HIREDATE | SALARY | AVERAGE |
SMITH | CLERK | 12/17/1980 | 800 | 2073.21 |
ALLEN | SALESMAN | 2/20/1981 | 1600 | 2073.21 |
WARD | SALESMAN | 2/22/1981 | 1250 | 2073.21 |
JONES | MANAGER | 4/2/1981 | 2975 | 2073.21 |
BLAKE | MANAGER | 5/1/1981 | 2850 | 2073.21 |
CLARK | MANAGER | 6/9/1981 | 2450 | 2073.21 |
TURNER | SALESMAN | 9/8/1981 | 1500 | 2073.21 |
MARTIN | SALESMAN | 9/28/1981 | 1250 | 2073.21 |
KING | PRESIDENT | 11/17/1981 | 5000 | 2073.21 |
FORD | ANALYST | 12/3/1981 | 3000 | 2073.21 |
JAMES | CLERK | 12/3/1981 | 950 | 2073.21 |
MILLER | CLERK | 1/23/1982 | 1300 | 2073.21 |
SCOTT | ANALYST | 12/9/1982 | 3000 | 2073.21 |
ADAMS | CLERK | 1/12/1983 | 1100 | 2073.21 |
Rather than give us the running average over time that we’re looking for, it has returned the current average against each row. And while that might be exactly what you want in some circumstances, it does illustrate the importance of those three options. So let’s take a look at them.
ORDER BY
You will have noticed that we had 2 order by statements in our query. The order by used by the analytic function is completely separate from that of the main query, and can be slightly – or totally – different.
The syntax is, as you may expect:
1 |
ORDER BY <expression> [ASC|DESC] NULLS [FIRST|LAST] |
As I suggested earlier, the analytical ORDER BY clause is often optional, but not always so. For certain functions it is crucial. Take the LEAD and LAG functions, for example. Unlike AVG from our previous examples, these functions aren’t aggregate functions moonlighting as analytic functions. They are purely analytic functions, and cannot be used in any other context.
The LEAD function returns the value based on a row at an offset from the current row. Or, in plainspeak, it enables you return a value based on a record one (or more) rows after the current one.
The LAG function peers in the opposite direction from the LEAD function. And since their very existence is based on the values of other rows, the ORDER BY clause is mandatory when using them.
If, for example, you wish to view not only staff salaries, but also the salary of the person hired after them.
1 2 3 4 |
SELECT ename, job, hiredate, sal, LEAD (sal) OVER (order by hiredate ) "Next Salary" FROM emp ORDER BY hiredate; |
ENAME | JOB | HIREDATE | SAL | Next Salary |
SMITH | CLERK | 12/17/1980 | 800 | 1600 |
ALLEN | SALESMAN | 02/20/1981 | 1600 | 1250 |
WARD | SALESMAN | 02/22/1981 | 1250 | 2975 |
JONES | MANAGER | 04/02/1981 | 2975 | 2850 |
BLAKE | MANAGER | 05/01/1981 | 2850 | 2450 |
Other functions for which the order by clause is mandatory include: FIRST, FIRST VALUE, LAST, LAST VALUE, ROW_NUMBER, RANK and DENSE_RANK. We’ll talk about them in the next in this series of articles.
PARTITION BY
Aaargh, I’ve just realised that I’m doing this all wrong, in totally the wrong order. That’s because, even though the ORDER BY and the PARTITION BY clauses are (often) optional, if they are used together the partition by must come before the order by.
Someone must have bought a new thesaurus on the day they named the partition by clause. Because, more or less, it’s just a way to group the data used by the analytic function, to carve up the recordset.
If, for example, you weren’t interested in the average salary across the company, but wanted to compare each employee’s salary to the average salary within her department, then you’ll have call to partition the data used by your analytic function.
1 2 |
SELECT ename, job, hiredate, sal, deptno, AVG(sal) OVER (PARTITION BY deptno) AS "Dept Average" FROM emp; |
ENAME | JOB | HIREDATE | SAL | DEPTNO | Dept Average |
CLARK | MANAGER | 06/09/1981 | 2450 | 10 | 2916.67 |
MILLER | CLERK | 01/23/1982 | 1300 | 10 | 2916.67 |
KING | PRESIDENT | 11/17/1981 | 5000 | 10 | 2916.67 |
FORD | ANALYST | 12/03/1981 | 3000 | 20 | 2175 |
SCOTT | ANALYST | 12/09/1982 | 3000 | 20 | 2175 |
JONES | MANAGER | 04/02/1981 | 2975 | 20 | 2175 |
SMITH | CLERK | 12/17/1980 | 800 | 20 | 2175 |
ADAMS | CLERK | 01/12/1983 | 1100 | 20 | 2175 |
However, we can slice and dice the data even further by combining the partition by clause with the order by clause. And so, to find out how the average salary changed by department with each new hire, we’d need a query along these lines:
1 2 3 4 |
SELECT ename, job, hiredate, sal, deptno, AVG(sal) OVER (PARTITION BY deptno ORDER BY hiredate) AS "Dept Average" FROM emp ORDER BY hiredate; |
ENAME | JOB | HIREDATE | SAL | DEPTNO | Dept Average |
CLARK | MANAGER | 06/09/1981 | 2450 | 10 | 2450 |
KING | PRESIDENT | 11/17/1981 | 5000 | 10 | 3725 |
MILLER | CLERK | 01/23/1982 | 1300 | 10 | 2916.67 |
SMITH | CLERK | 12/17/1980 | 800 | 20 | 800 |
JONES | MANAGER | 04/02/1981 | 2975 | 20 | 1887.5 |
FORD | ANALYST | 12/03/1981 | 3000 | 20 | 2258.33 |
SCOTT | ANALYST | 12/09/1982 | 3000 | 20 | 2443.75 |
ADAMS | CLERK | 01/12/1983 | 1100 | 20 | 2175 |
A second example might make the power of the partition by-order by pairing more obvious. If, for example, you wanted to list all the employees in the company alongside the name of the person hired in their department right before them, you might use the following:
1 2 3 4 |
SELECT ename, hiredate, deptno, LAG (ename) OVER (PARTITION BY deptno ORDER BY hiredate) AS "Previous Hire" FROM emp ORDER BY deptno, hiredate; |
ENAME | HIREDATE | DEPTNO | Previous Hire |
CLARK | 06/09/1981 | 10 | |
KING | 11/17/1981 | 10 | CLARK |
MILLER | 01/23/1982 | 10 | KING |
SMITH | 12/17/1980 | 20 | |
JONES | 04/02/1981 | 20 | SMITH |
FORD | 12/03/1981 | 20 | JONES |
SCOTT | 12/09/1982 | 20 | FORD |
ADAMS | 01/12/1983 | 20 | SCOTT |
Windowing
Windowing is a way to further sub-partition the recordset operated upon by an analytic function. Look at it this way: without the partition by clause, we now know that the analytic function will be based upon the complete recordset returned by our query; with the partition by clause, it is only based on a subset of that data. Well, with the windowing clause we can tell it to only look at a subset of that subset.
I know what I’ll do. I’ll give you an example that shows what the windowing clause does. And then afterwards we can talk a little more about its syntax and rules.
Let’s say we wanted to know the sum of each employee’s salary and those of everyone hired within his department as long as they were hired no more than 60 days earlier than him. I don’t know why we’d want to know that, but let’s pretend that we do. ‘Cos that’s where windowing comes in.
1 2 3 4 5 6 7 |
SELECT ename, hiredate, sal, deptno, SUM(sal) OVER () AS "Total Salary", SUM(sal) OVER (PARTITION BY deptno) AS "Total salary by dept", SUM(sal) OVER (PARTITION BY deptno ORDER BY hiredate) AS "Total dept salary by date", SUM(sal) OVER (PARTITION BY deptno ORDER BY hiredate RANGE 60 PRECEDING) AS "Sum dept salary over 60 days" FROM emp ORDER BY deptno, hiredate; |
ENAME | HIREDATE | SAL | DEPTNO | Total Salary | Total salary by dept | Total dept salary by date | Sum dept salary over 60 days |
CLARK | 06/09/1981 | 2450 | 10 | 29025 | 8750 | 2450 | 2450 |
KING | 11/17/1981 | 5000 | 10 | 29025 | 8750 | 7450 | 5000 |
MILLER | 01/23/1982 | 1300 | 10 | 29025 | 8750 | 8750 | 1300 |
SMITH | 12/17/1980 | 800 | 20 | 29025 | 10875 | 800 | 800 |
JONES | 04/02/1981 | 2975 | 20 | 29025 | 10875 | 3775 | 2975 |
FORD | 12/03/1981 | 3000 | 20 | 29025 | 10875 | 6775 | 3000 |
SCOTT | 12/09/1982 | 3000 | 20 | 29025 | 10875 | 9775 | 3000 |
ADAMS | 01/12/1983 | 1100 | 20 | 29025 | 10875 | 10875 | 4100 |
ALLEN | 02/20/1981 | 1600 | 30 | 29025 | 9400 | 1600 | 1600 |
WARD | 02/22/1981 | 1250 | 30 | 29025 | 9400 | 2850 | 2850 |
BLAKE | 05/01/1981 | 2850 | 30 | 29025 | 9400 | 5700 | 2850 |
TURNER | 09/08/1981 | 1500 | 30 | 29025 | 9400 | 7200 | 1500 |
MARTIN | 09/28/1981 | 1250 | 30 | 29025 | 9400 | 8450 | 2750 |
JAMES | 12/03/1981 | 950 | 30 | 29025 | 9400 | 9400 | 950 |
Windows are dynamic in nature, and are always relative to the current row. In our example above, the window is always 60 days prior to the current hire date.
However, RANGE … PRECEDING is not the only way we can employ windows. Indeed the available syntax options can seem as jumbled as a tangle of spaghetti as this image from the Oracle Docs site shows.
However, here’s what you need to know.
- Your window can be based upon ROWS or a RANGE. If your window is based on ROWS, this means that it is determined by the rows in your query’s partition; RANGE means it is based on a logical offset. In our example, the range was 60 days.
- UNBOUNDED PRECEDING says your window starts with the very first row of your partition. UNBOUNDED FOLLOWING says it goes all the way to the end of the partition.
- CURRENT ROW means … well, it’s pretty obvious, right? I won’t insult your intelligence by telling you.
Having said all of that, what’s actually interesting is the value_expr. Outside of UNBOUNDED PRECEDING|FOLLOWING and CURRENT ROW, this is the main way we express the size of our window.
If your window is based on ROWS, the value_expr must be a positive number; for RANGE, it can also be an interval literal. When using a RANGE, as in our example, you can only have one expression in your ORDER BY as your value_expr will reference it. (In our example, RANGE 60 PRECEDING knew our window was 60 days wide because our partition was ordered by a date, hiredate.)
Restrictions
There are a few things that analytic functions cannot do.
- Analytic functions are computed after the main query’s FROM, WHERE and GROUP BY clauses. This means, regrettably, that they cannot be part of your WHERE clause, and must live in the SELECT. Obviously, there is nothing to stop you from plugging them into your WHERE clause as part of a subquery or a WITH statement, but that’s the only way around this restriction.
- You cannot nest analytic functions within another analytic function. So stop trying to be too clever.
- You cannot use column aliases or column positions in an analytic function’s order by clause.
Conclusion
Could you survive without analytic functions? Yes, totally. It’s the same way you could survive without chocolate or wine. You could – but why would you want to?
As I said at the start of this article, you could achieve every result that an analytic function will give you with the clever use of self-joins and subqueries. But it’ll be cumbersome, difficult and inefficient.
In the next part of this series, we will talk about the various analytic functions and how to use them.
Load comments