Optimizing my client and server with ANTS Performance Profiler

It might seem a hopeless task to reduce the time an application takes to do a task from three minutes to a second. However, after getting hold of a tool that showed him precisely what was taking the time, Jonathan Evans gained a renewed fascination with performance that led to him shaving what he terms the "banana seconds" off the execution time down to that last second.

I recently found that an application I had been working on was running several orders of magnitude too slowly. As you can imagine, this was an absolute show stopper – we needed to be able to display results in around 1 second, and instead it was taking three or four minutes. I wasn’t totally despondent, as I hadn’t attempted any real optimisation yet and thought there would be quite a bit of scope for improvement – but if we couldn’t sort it out, I was anticipating career-limiting consequences.

The application was a C# server and a command-line client. The server wrapped up a MATLAB dll (created using Matlab Builder NE) that did all of the heavy lifting, which as far as we were concerned could not be changed. On the client side, an XML document was being used to define the processing to be performed on the server, and the results were being captured in an in-memory SQLite database before being manipulated and written out to disc.

A First Look

Using some fairly crude methods in the code, I could see that both the client and server were taking about the same length of time to do their work, and the network time was insignificant. So, although there were lots of opportunities to optimise things on both the client and the server, the first task was to find out where to start, and for that I’d need a profiler.

A quick look around showed that Red Gate’s ANTS Performance Profiler had received good reviews and, more importantly, was within my fairly limited budget. My biggest concern was that the processing time on the server was going to come mostly from the MATLAB code, and there would be little of nothing that I could do. To get this potential horror out of the way as soon as possible, I started by profiling that.

Speeding up the Server

Some experimenting with ANTS allowed me to connect to the server and the client, and then I used the line-by-line option so I could look at timings for individual lines of code. The first run had lots of overheads that had to be discounted, so for testing I connected and ran a single job against the server, then waited a few seconds and ran a batch of 25 jobs. This gave me some nice clear results, and I was able to select just the range of data I was interested in (some tools can only show you the totals for the whole run).

The results weren’t immediately encouraging – I listed all of the results (all procedures, not just my code) and sorted by % of time spent in code, and the top 75% all came from MATLAB – the bit I couldn’t change. “But hold on,” I told myself, “most of it seemed to be to do with array manipulation and disposal“. Using the tree navigation thingy (officially known as the Call Graph) I was able to see that the code causing trouble was in fact being called by my code, and I could drill down and see that most of the time was actually being spent marshalling data between the MATLAB runtime and the .NET wrapper.

1111-APP6.jpg

Figure 1. Examining MATLAB.dll running on the server.

A bit of searching around led me to the conclusion that I could get all of the data out of MATLAB using just a single call that created a complex structure of nested arrays. I spent a bit of time coding this (it’s not good for code readability, but using constants to identify the fields in the arrays helps mitigate that) and tested again, with fantastic results – immediately, I had chopped off about 50% of the processing time.

However, all of the dispose code was still being called, I still wasn’t sure what could I do about that – I hit Google again, but without luck this time. Suddenly a brainwave happened – I could send the results back to the client, and then do all of the cleanup work. I changed things around and saw the results immediately. It’s worth noting at this point that, although the results were being sent back to the client sooner, the dispose code itself was essentially unchanged, and things still looked the same in ANTS – all the work still had to be done, just in a different order.

That was the server sorted then; from 1-2 minutes down to under a second – now for the client. This promised to be “interesting” from the outset, as there was loads of string manipulation and nested for-loops manipulating arrays and collections.

Cutting Down the Client

The first batch of tests showed that a group of low-level .NET string splitting functions were by far the biggest offenders, which was a slight surprise. Once again, the ANTS call graph came to the rescue; it showed that a bit if premature optimisation which I had slipped in earlier was causing the problem. I had used a string builder to concatenate all of my SQLite update statements into one giant semi-colon delimited list. This seemed like a good idea at the time, but what actually happened was that the ADO.NET wrapper then had to split that huge string back into individual statements before executing them. So, naturally, I changed the code to make individual calls, which actually made it simpler and more readable.

That made a fairly substantial improvement, and now I could see that the code which generated column names was taking up about 30% of the time. The solution this time was to pre-calculate the names and stick them into an array to be reused, which chopped another 29% off the run time. Now the SQLite updates re-emerged as a time consumer. Luckily, the restructuring that had allowed me to cache the column names also meant that I could combine lots of little updates to small groups of columns into fewer, bigger updates that affected the whole record – another run time reduction of about 50%.

The client was now down to about 10 seconds, and I was beginning to get addicted. My focus switched to another section of code which was taking the raw XML data that defined the jobs to be done, and grouping those jobs to minimise the number of calls to the server. This involved a lot of nested loops and lists of results, and the first thing I noticed was a lot time being spent in Contains() and Item[] methods. Of course, a list isn’t structured in a way which helps with lookups (O(N) complexity), so what I need was something that used trees or hashtables (with a complexity of 0(1), such as a dictionary, for example). So I changed the code to use a dictionary, and we were immediately down to 2 seconds, which was pretty close to the required time.

Shaving off Seconds

We were now in the region of what a former colleague calls “banana seconds” (a phrase coined by a peer who thought I was “bananas” in my obsession to save valuable clock cycles). Looking at the profiling results, I could see that my greatest costs were now associated with really small things, like getting a date from a string and formatting it for saving into the database. To find the best way of doing this, I ended up with a series of different approaches in the code, allowing me to compare timings. The winner was to convert the source string into an array of chars, and to then copy the individual digits of the data into another array that contained the separators, which could then be used to construct a new string.

Before:

After:

My attention was now moving from the time in code column to the total iterations column. Sorting through here showed that some lines were being hit millions of times, and even a small change could make a significant improvement. Some more tweaks (it turns out using the single char overload of the Append method on a string builder is quicker than adding a single character string!) and we were finally down to one second. Good as that sounds, it still meant a total time of two seconds, because the server was also taking a second

The Final Countdown

Finally, I started getting hardcore; I found I could use pointers instead of normal array manipulation, which meant that, when working with arrays, I could skip all of the bounds checking. This turns out to save about 30% of the time spent dealing with arrays, which was a significant amount of time in the places where I had lots of nested loops dealing with arrays. This was not without risks, obviously, but I concluded that, in the case of code that was entirely based around array dimensions (rather than values passed in from outside), it was a risk I was willing to take. I also had some very heavily called code on the server for creating strings from arrays of chars, and it turns out that there are constructors to string that take a pointer to an array of char. So, applying these tweaks to the server as well, I could finally get the total time to just over a second. My job was safe!

Ants Performance Profiler is available to download from the Red Gate website.