Profiling the Profilers – Part 1: My Timeline has a Memory Leak!

During the final stages of testing for a recent release, a member of our team noticed that memory usage in ANTS Performance Profiler increased over time as they selected different parts of the timeline when the “Database calls” view was open: each time they selected a different region, memory usage rose by a few tens of MB, and never seemed to fall again. We knew we had a memory leak – now, we had to find it, and fix it. Unfortunately, ANTS Performance Profiler is a hugely complex piece of software, and trying to find this leak by just reading the code was simply not an option – there’s just too many things it could be. Fortunately, we have another .NET profiler – ANTS Memory Profiler – whose whole purpose is to help find the cause of memory leaks like this.

A quick primer on memory problems

In traditional, unmanaged code, memory leaks are generally caused by memory being allocated but not deallocated – for example, you create an array of pointers-to-objects, create some objects and put their pointers in the array. Later, you delete the array, but the objects remain in memory as they have to be explicitly deleted – you’ve got a memory leak. In managed code, this situation is handled for you by the garbage collector: when the array is deleted, it notices that the objects in the array are not referenced by anything anymore, and cleans them up for you.

In managed code, therefore, memory leaks typically happen when objects accidentally hold onto objects they no longer need – sometimes by simply holding a reference to an object directly, but often because of an indirect and complex chain of references, event handlers, and delegates.

ANTS Memory Profiler helps us identify the objects that are hanging around in memory unexpectedly by letting us take snapshots of a process’ memory, and comparing those snapshot to find out which objects have survived from one snapshot to another; having identified the problem objects, it then lets us drill down into the chain of references keeping those objects alive.

Finding the problem

So, back to my problem with ANTS Performance Profiler. To begin investigating, I launched the Performance Profiler from ANTS Memory Profiler, loaded some previous profiling results into it and switched to the “Database calls” view. Once this initial setup was complete, I took a memory snapshot to act as a baseline. I then clicked around in the timeline in ANTS Performance Profiler, and took a second snapshot. Having done this, ANTS Memory Profiler presented me with this:

image1-p1.png

There are several things to note here: firstly, I can see from the timeline at the top of the window that both the private bytes and working set of the process increased over time as I was clicking around. This is confirmed by the “Total size of live objects” table, which tells me that the second snapshot has objects using 97.25MB more memory than the first. Finally, the “Largest classes” table at the bottom right shows us that StackTrace objects are my biggest consumer of memory, and that the memory consumed by them has increased dramatically between the two snapshots. This gives me a hint that these are the objects being leaked.

Next, switching to the class list view and sorting the table by “Size diff (bytes +/-)” I can see that, yes, the biggest increase in memory use comes from those StackTrace objects, and that the second snapshot has nearly a million more instances compared to the first!

image2-p1.png

I know from the code that StackTraces are tied to profile range selections – when the user selects a profile range in the timeline, the profiler reads the data for that period from disc and constructs a ProfileRange object that holds the stack traces recorded in that time period. When the profile range is changed, the old range should be destroyed, and the StackTrace objects along with it – but it looks like, for some reason, this is not happening here. I can confirm this by taking more snapshots (changing the timeline selection between each) and noting the number of live instances climbing each time.

But why?

Identifying the leaking object is only the first part of the story, though – I still need to understand why all those objects are hanging around. There are a large number of relatively small objects being leaked (as opposed to a small number of large objects), and ANTS Memory Profiler can help me by grouping those objects according to the reference chains holding onto them, and finding the most common.

Opening the “Instance Categorizer” view with the StackTrace row selected shows me this:

image3-p1.png

The profiler has identified a single chain that represents at least 60% of the objects that I’m interested in. This view in ANTS Memory Profiler shows me the chain of references that hold this set of StackTrace objects in memory – the object I’m looking at is on the right of the chain, and reading the chain from right-to-left shows me the objects that are keeping it alive through a chain of references. If I scroll back a little way, I see:

image4-p1.png

These StackTrace objects are being kept alive by SqlTableQueryRows – which makes sense with what I’m seeing – I saw memory usage rise when clicking around the timeline with the “Database calls” table open. Each row holds a reference to a ProfileRange object, which in turn has a reference to a large number of StackTraces containing the profiling results data. I know from working with the ANTS Performance Profiler codebase that the table rows are rebuilt from scratch each time a new region of the timeline is selected – old rows should be thrown away and new ones constructed and added to the table – but it looks like the old rows are still hanging around. Continuing further back along the chain, I get to:

image5-p1.png

And this is my smoking gun: the SqlTable is the UI component that shows the SQL queries; the SqlTable has a dictionary which maps ITableRows to IconLocations – it’s used for working out which row of the table a user has clicked on. I know that SqlTableQueryRow implements ITableRow, and that the SqlTable object persists over the whole session; when the table changes, it clears out its rows and adds a new set, so it seems this dictionary is holding onto rows for longer than it should. We would expect this dictionary to be cleared out whenever the table itself is changed but, looking at the code for the SetTable() method (which is responsible for updating the table with new data), this doesn’t appear to be happening:

So, if I’m right, then adding code to clear out this dictionary at the same time should fix the problem:

Having made this change, I rebuilt ANTS Performance Profiler and repeated the same memory profiling exercise from earlier (being careful to make sure I did the same thing both times), and what I now see is:

image6-p1.png

There’s still a rise in memory – but it’s much smaller than before (~60MB vs ~100MB), and more importantly, the increase in memory use by the StackTrace objects has almost halved. So it looks like clearing out that dictionary has definitely helped – but there’s still a bit of a leak. I’ll talk about that next time.