Which of these things is not like the other?

Have you ever played the game where you try to identify the differences between two side-by-side seemingly-identical pictures? Sometimes answering the question “why does my query perform well in development, but it drags in test and production” is the same sort of game: you must identify where your environments differ, either obviously and subtly, then weigh those differences to determine… Continue Reading →

Have you ever played the game where you try to identify the differences between two side-by-side seemingly-identical pictures? Sometimes answering the question “why does my query perform well in development, but it drags in test and production” is the same sort of game: you must identify where your environments differ, either obviously and subtly, then weigh those differences to determine the root cause or causes of query performance problems. After all, computers are deterministic machines (no matter how much we sometimes refuse to believe it!)–given the same set of inputs, same conditions, same resources, same algorithm, etc. we should expect the same results in the same amount of time–or at least something comparable, given that it is impossible to control for all variables. If two machines yield significantly different results, then it must be because of some difference, and that difference is our quarry!

Recently I was working on a query to take a self-related hierarchical table and work out, from each record, all its ancestor records. This query involved both a cartesian join and a start with…connect by clause:

```select     B1.BAR_ID BASE_BAR_ID,
B2.BAR_ID,
B2.SHORT_NAME,
B2.DESCRIPTION,
B2.SOME_TEXT_DATA,
B2.PARENT_BAR_ID,
B2.SOME_MORE_TEXT_DATA
from       FOO.BAR B1 cross join FOO.BAR B2
start with B1.BAR_ID = B2.BAR_ID
connect by B1.BAR_ID = prior B1.BAR_ID
and        B2.BAR_ID = prior B2.PARENT_BAR_ID;
```

I know what you’re thinking: “are you mad?” Before having me committed, however, consider the following. The table in question has approximately 700 records in it, so the cartesian can result in at most 490,000 narrow records. Any given starting point has at most 4 or 5 ancestors, and most records usually have fewer than that. The final result set ends up being approximately 2,400 records. Finally, this is a table that is not modified by application users; only a handful of records are added to the table with any given release of the application, so between releases the query’s average performance can be determined and should remain consistent.

In development, the query ran in less than a second. Rather pleased that I had devised such a clever, elegant query, I copied and pasted it into a session on the test database, kicked it off, and waited… and waited… and waited… 45 seconds to get a result set identical to what I had in development. Since development and test had been synchronized only a few days before, I had the same results, but I didn’t expect such wildly divergent performance. Time to look for the differences!

The first place I looked was at the data to verify that it was identical in both environments–after all, I’m only one developer on the team, and one of my teammates could have modified the data unbeknownst to me. The data was exactly the same, column for column, row for row, so that wasn’t the problem.

My next step was to look at statistics and do an explain plan in both environments. Statistics were up to date in both databases, and explain plan yielded exactly identical execution plans, step by step, cost by cost, for each environment. I also ruled out the possibility that one database had fragmentation in its table due to inserts and deletes; both tables were packed as tightly as they would go, with no superfluous blocks.

Maybe there’s something wrong with the test server itself; maybe the query will run okay on the production server. The DBA and I ran the query on the production box, and got results in 3 minutes, 30 seconds. The faster server with more resources in terms of memory and disk space yielded worse performance! The production server had data identical to the data in development and test, its statistics were up to date, and its explain plan for my query yielded exactly exactly identical to the plan that development and test were using.

So, if the data is the same, statistics are the same, plans are the same, what would be a logical next place to look?

In my case, it was the database parameter files. I had discovered long ago (and subsequently forgotten) that, for whatever reason, the development, test, and production versions of this particular database aren’t configured identically. The next step was to query the V\$PARAMETER view and look for differences.

Among the three databases, there were at least a dozen settings which were different, but one parameter stood out immediately: WORKAREA_SIZE_POLICY. When set to AUTO, the database will work out an appropriate amount of memory to use for various operations like sorting and hashing; when set to MANUAL, the database uses other parameters like SORT_AREA_SIZE to determine whether to sort a result set in memory or flip it out to disk and perform the sort there. According to Oracle, “Specifying MANUAL may result in sub-optimal performance and poor PGA memory utilization.” Indeed!

In my case, the development database was configured to use automatic work area sizing; test and production, however, were configured to use manual work area sizing, and SORT_AREA_SIZE was set to Oracle’s default value of 64K. The execution plan was estimating a result set twice that size. Eureka! I bet my result set is going out to disk for sorting instead of staying in memory! I’ve found a likely candidate!

Armed with my list of parameter file differences and ready to make my case for adjusting database parameters, I met with the DBA the next morning to test my theory. He adjusted the WORKAREA_SIZE_POLICY parameter (along with the appropriate PGA target parameters) in the test system and we ran the query. Less than 1 second for the entire result set to be returned! This was precisely the problem on the production server as well; once these parameters were appropriately modified on the production server, the query performed just as well there, too. Though I don’t have the “before” metrics to prove it, I’m sure the application users also benefited from that modification!

Finding and identifying the problem required a systematic approach of identifying points of similarity and difference between data, statistics, and database parameter settings. Usually query performance differences are the result of different execution plans, which in turn are usually caused by differences in data cardinalities, staleness (or even non-existence!) of statistics, density of data in table blocks, etc.; in cases like this one, however, once you’ve eliminated the possibility of differences in data and execution plans, you may need to enlist the aid of your DBA to help you work through database settings. There may even be cases where you’ll need to contact your systems support folks to look at the physical machine (it’s entirely possible that a fragmented, choked for space hard drive or a memory card on its last legs can be your culprit!). Parameter settings or machine issues are much less frequently the issue than data differences or workload, but don’t forget to investigate them once you’ve ruled out the most-likely possibilities. Think through your problem logically and methodically and sooner or later you’ll arrive at your own eureka!