In parts 11 and 12 of this series I described the “rowsource execution statistics” that we can collect as Oracle runs a query, then described a strategy for generating and accessing these statistics in a way that was particularly convenient if you could use your own session to run the SQL you wanted to analyze. In this article we’re going to look at a strategy for generating and accessing the statistics when it’s not your session that’s running the SQL.
Although you can enable rowsource execution statistics at the system level rather than the session level, it’s not really desirable to do so for two reasons. First, the basic overheads – every single execution of every single statement run by every single session would be doing work to collect the statistics; secondly, aggregation would hide interesting detail – each child cursor would accumulate the statistics for every execution so you wouldn’t be able to see if there were any special cases that might matter.
To get detail and precision we need to target something smaller, like a single statement, or a single session, and log every step of every execution; and we can do this with “extended SQL trace” in one of its many forms.
There are two popular strategies for enabling extended tracing – by session or by SQL_ID, the latter becoming possible only in the more recent versions of Oracle. There are numerous articles around the internet describing how to enable session tracing so I’m not going to go into detail about all the possibilities, I’ll just summarize one method which is probably the most approved method; I’ll also restrict myself to one simple example of how to enable statement tracing. For one of the most up-to-date notes on what SQL trace can do, you should bookmark Christian Antognini’s note on the full list of levels of the 10046 trace, and Tanel Poder’s note on many more options that you can use with tracing.
Historically you could either start tracing your own session with a call to “alter session”, or you could start tracing in another session with a call to oradebug; then Oracle introduced some tracing calls in the packages dbms_system, and dbms_session; but the “officially approved” mechanism is now through the dbms_monitior package (although, as you will see in Christian’s blog note, the package still doesn’t quite cover the full range of possible 10046 levels).
begin dbms_monitor.session_trace_enable( session_id => 99, serial_num => 23154, waits => true, binds => true, plan_stat => 'all_executions' ); end; / -- allow some time to pass begin dbms_monitor.session_trace_disable( session_id => 99, serial_num => 23154 ); end; /
The values for session_id and serial_num come from v$session.sid and v$session.serial# for an active session. Between the calls to enable and disable tracing all statements executed by the session will be traced. In my sample code above I’ve asked for all wait states and values of bind variables to be dumped into the trace file as well as the execution plan with rowsource execution stats; and I’ve asked for the plan and stats for every single execution of every statement (by default the plan is dumped only on the first execution after tracing starts). I’ll postpone talking about the content of the trace file until after the example of enabling tracing for a statement.
Although session-level tracing can be very helpful – especially when testing – what do you do when you can see a frequently executed statement in a production AWR report that seems to be using a lot of resources, but always seems to be very efficient whenever you test it from your own session? Somewhere, somehow, sometimes, it must be doing something expensive, but you can never see it happening. You could try enabling SQL trace on a session that you know executes this statement – but (a) you could still be unlucky and find that that session doesn’t have any problems with the statement and (b) you could be tracing a lot of stuff you’re not interested in at the same time.
The solution is in the new 11g syntax for enabling tracing. (For more detail on the power and range of the 11g options, a good starting point is Tanel Poder’s note about oradebug.) Here’s an example of a call I used recently on a client system demonstrating one use of the method:
alter system set events 'sql_trace[SQL:1104btkdvqx5k] plan_stat=all_executions,wait=true,bind=true' ; -- waited 60 seconds alter system set events 'sql_trace[SQL:1104btkdvqx5k] off';
In the short wait time between enabling and disabling tracing every execution of the statement with SQL_ID = ‘1104btkdvqx5k’ made the session that was executing it dump the bind variables, plan, and wait states for that execution to disc in the session’s trace file.
The reason I had done this on the client site – a production system – was that the performance of various jobs run by the users showed extreme variation in run-time, particularly when the system was under heavy load. This SQL_ID identified a critical, frequently executed, statement that was one of the most important in the system, and by enabling tracing for this brief interval I got lucky and ended up with trace files from 9 separate sessions which, after processing through tkprof, gave us the results from 16 executions of the statement.
The statement was an anonymous PL/SQL block, and the effect of enabling tracing on the block was that every SQL statement called by the block was also traced. This turned out to be tremendously helpful as it showed us two things – first that a large fraction of the performance problem related to excessive calls (a couple of thousand) to some security predicates being used for row-level security (a problem that we had already identified by other means and were able to address); secondly that most of the remaining time (and variation) was caused by literally hundreds of calls to the following SQL statement:
SELECT TO_CHAR(TO_DATE(:B3 ,:B2 ,'NLS_CALENDAR=''GREGORIAN'''),:B1 )FROM DUAL
Since I had enabled bind variable tracing I went back to the raw trace files and found that the query was always called with one of two possible date values (either 01-01-1000, or 31-12-4172), and checking the output from tkprof I found an SQL statement with the following text in its where clause:
DATE_COL BETWEEN TO_DATE(UTILS.NORMALIZE_DATE('01-01-1000', 'DD-MM-YYYY'), UTILS.GET_NLS_DATE_FORMAT)) AND TO_DATE(UTILS.NORMALIZE_DATE('31-12-4712', 'DD-MM-YYYY'), UTILS.GET_NLS_DATE_FORMAT))
Predicates using PL/SQL functions that call SQL are dangerous things – but it’s often easy to minimize the number of times the SQL is called, typically by declaring the function to be deterministic, or by embedding the call into a scalar subquery viz: (select function_call() from dual). A couple of minutes of highly targeted tracing made it very easy to find a couple of highly relevant issues that could be fixed fairly easily.
Examining the output
Whether you enable session tracing or statement tracing, the way you then use the resulting trace files is the same – first you use tkprof to summarize the content of the trace files, possibly after using the trcsess utility to extract and combine data from several trace file; then you look for clues about where the most time was spent or most work was done or the greatest variability appeared; then (optionally) you go back to the trace files to extract and examine the detail of the issue you’ve identified.
Here’s an example of the tkprof output that you might get from a single session after enabling the system-level tracing of a single statement in 18.104.22.168:
select id from t1 where access_code = :b1 and filter_code = :b2 call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 0 0.00 0.00 0 0 0 0 Execute 6 0.00 0.00 0 0 0 0 Fetch 10 0.01 0.01 48 1710 0 6 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 16 0.01 0.02 48 1710 0 6 Misses in library cache during parse: 0 Misses in library cache during execute: 1 Optimizer mode: ALL_ROWS Parsing user id: 62 Number of plan statistics captured: 6 Rows (1st) Rows (avg) Rows (max) Row Source Operation ---------- ---------- ---------- --------------------------------------------------- 2 1 2 TABLE ACCESS BY INDEX ROWID T1 (cr=285 pr=8 pw=0 time=3246 us cost=6 size=12 card=1) 5 3336 10000 INDEX RANGE SCAN T1_I1 (cr=9 pr=8 pw=0 time=9056 us cost=1 size=0 card=5)(object id 188624) Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ SQL*Net message to client 8 0.00 0.00 db file sequential read 2 0.00 0.00 SQL*Net message from client 4 0.00 0.00 db file scattered read 3 0.00 0.00
One of the nice things that 11g introduced was the “three column” summary of rows supplied by each rowsource – from the 1st execution, the average of all executions captured, and the maximum across all executions. (Note that each value in the “Max” column could have come from a different set of stats).
In this case we can pick up three details very easily:
- From the summary table we had 6 executions
- From the following text notes we have captured 6 sets of plan statistics
- From the “Rows” details in the one plan shown we see a massive variation in rows produced
Closer inspection then shows us that the rowsource execution stats in the plan (particularly the cr= and pr= figures) are the averages across the 6 executions – a fact we deduce by dividing the disk and query columns in the Fetch line of the tabular data by 6. (It’s unfortunate – though irrelevant in this case – that the rowsource lines don’t show us the current (cu) block gets; this would be a useful reminder of the cost of updating indexes when we insert, update, or delete rows in a table).
Given the wide variation in stats, we should be prompted to look at the trace file itself. First we look for the statement so that we can find the cursor number that we need to track (#140131987865592 in this case). Be aware that the cursor could close allowing the cursor number to be re-used later in the trace file.
PARSING IN CURSOR #140131987865592 len=64 dep=0 uid=62 oct=3 lid=62 tim=1424880601505379 hv=3683358447 ad='74fef4d0' sqlid='a5u6333dsr5rg' select id from t1 where access_code = :b1 and filter_code = :b2 END OF STMT
Then we can look for the “Bind values” and “Execution stats” for each execution. Here’s the first one (with some lines deleted for simplicity):
BINDS #140131987865592: Bind#0 value=1216 Bind#1 value=5 ... STAT #140131987865592 id=1 cnt=2 pid=0 pos=1 obj=188623 op='TABLE ACCESS BY INDEX ROWID T1 (cr=8 pr=1 pw=0 time=405 us cost=6 size=12 card=1)' STAT #140131987865592 id=2 cnt=5 pid=1 pos=1 obj=188624 op='INDEX RANGE SCAN T1_I1 (cr=3 pr=1 pw=0 time=350 us cost=1 size=0 card=5)'
Although it’s cosmetically inelegant we can see that the first execution was fairly precise and efficient. Here’s a slightly cleaned version of the STAT lines – notice that the trace file had “id=” and “pid=” entries on each line to give us the parent/child indentation that we normally expect to see:
cnt=2 TABLE ACCESS BY INDEX ROWID T1 (cr=8 pr=1 pw=0 time=405 us cost=6 size=12 card=1) obj=188623 cnt=5 INDEX RANGE SCAN T1_I1 (cr=3 pr=1 pw=0 time=350 us cost=1 size=0 card=5) obj=188624
We can now track through the rest of the file checking for the other 5 sets of bind variables and rowsource execution stats to see if we can spot any reason why some of the executions were very expensive. My data set actually had an extreme value which introduced a problem – here are the details from one of the (two) bad executions:
BINDS #140131988046400: Bind#0 value=-1 Bind#1 value=-1 … STAT #140131988046400 id=1 cnt=0 pid=0 pos=1 obj=188623 op='TABLE ACCESS BY INDEX ROWID T1 (cr=841 pr=30 pw=0 time=9617 us cost=6 size=12 card=1)' STAT #140131988046400 id=2 cnt=10000 pid=1 pos=1 obj=188624 op='INDEX RANGE SCAN T1_I1 (cr=22 pr=30 pw=0 time=27520 us cost=1 size=0 card=5)'
Picking your way through the details in the STAT lines you can see that the index range scan took 22 consistent gets (and 30 physical reads!) to find 10,000 rowids; then the table access took a further 841 consistent gets (and, presumably, a lot of “buffer is pinned count”) to find no matching rows.
My data set has an access_code of -1 that identifies a lot of rows, but a filter_code of -1 that discards it. It looks as if I need to take some defensive action (e.g. adding the filter_code to the index on access_code) to minimize the impact of that extreme case. (In passing, this example is a model of a client issue, and I used exactly this method to identify the problem – I wasn’t allowed to look at the data, but I was allowed to see some trace files I’d asked for.)
From 11g (particularly) it is very easy to target sessions and statements for precise tracing; and the tkprof output from a trace file can easily point you in the direction of the most interesting SQL statements – the 1st/avg/max columns for rows returned give you a very nice indicator of data skew.
Although the trace file itself can be a little difficult to read, if you’ve got a specific statement that the tkprof output has highlighted then it’s fairly straightforward to pick out the interesting lines from the trace file. I typically end up using vi and searching for the cursor number, but if you’re proficient with Unix utilities it’s easy to use awk to get the bind values and grep to get the rowsource execution stat lines.