NET Performance Cribsheet

Robyn and Phil tackle the topic of how to make .NET applications perform well. As usual, they try to take a terse, practical approach to the mysteries of JIT, CLR and GC. After giving many performance tips, they come to the conclusion that rules and tips are less useful than rolling up your sleeves and measuring or profiling everything possible, to see what's really happening with your application.

.NET performance- The Crib Sheet

For things you need to know rather than the things you want to know

Contents

Overview

If performance is important to an application you are developing, then a ‘performance culture’ should permeate the whole development process, right through from initial design to acceptance testing.

This may seem obvious, but a recent high-profile public IT project failed when, just prior to release, it was discovered that each business transaction was taking 27 seconds. It is hard to imagine the sort of developmental chaos that could produce this sort of outcome. Obviously, performance wasn’t in the functional spec, and wasn’t considered when picking the application architecture.

There are several “rules” to developing .NET applications that perform well. You should:

  • Design first, then code. The application’s architecture is the largest contributing factor to its performance
  • Have the right tools available to measure performance, and use them to understand how long the processes take to execute, and why.
  • Write code that is clear and is easy to maintain and understand. By far the greatest optimisations come through changing the algorithm, so the clearer the code, the easier it is to subsequently optimise. Never indulge in the habit, well-known amongst Perl programmers, of writing obscurantist code from the start, in the mistaken belief that it will run faster at compile time.
  • Gain an understanding of the underlying tasks that the framework performs, such as memory management and JIT compilation.
  • Set performance goals as early as possible. You must decide what represents a ‘good’ performance: perhaps by measuring the speed of the previous version of the application, or of other similar applications. Make sure that clear, documented, performance objectives on all parts of the project are agreed ‘upfront’, so that you know when to start and stop optimising, and where the priorities are. Never micro-optimise. You should revise performance goals at each milestone
  • Only optimise when necessary. Because detailed optimisation is time-consuming, it should be carefully targeted where it will have the most effect. The worst thing a programmer can do is to plough through his code indiscriminately, changing it to optimise its execution. It takes time, makes the code more difficult to understand and maintain, and usually has very little effect.
  • Avoid optimising too early. Detailed optimisation of code should not be done until the best algorithm is in place and checked for throughput and scalability.
  • Not delay optimisation too far. The most performance-critical code is often that which is referenced from the most other places in the application: if the fix requires a large change to the way things operate, you can end up having to rewrite or refactor a huge portion of your application!
  • Assume that poor performance is caused by human error rather than the platform. It is always tempting to blame the technical platform, but the CLR is intrinsically fast, and is written from the ground up with performance in mind. It is easy to prevent it doing its job by doing things that the design didn’t allow for. Even the fastest car slows down if you insist in using the wrong gear.
  • Employ an iterative routine of measuring, investigating, refining/correcting from the beginning to the end of the product cycle

Measuring and identifying

There are many aspects to code performance. The most important of these for the developer are throughput, scalability, memory usage, and the startup time of the application. All of these must be measured.

The first thing a programmer should do is to find out which part of the code is causing the problems. Usually, the starting place is to use Performance Counters and the CLR Profiler, using a variety of workloads. Don’t make any assumptions about the performance of the platform or of the APIs you are using. You need to measure everything!

This phase of testing and measuring always brings surprises. Other than the Performance Counters and the CLR Profiler, you will want to use a conventional third-party profiler to find out which methods in your application are taking the most time and are being called the most often.

Performance Counters

Performance Counters may be accessed either programmatically or through the Windows Performance Monitor Application (PerfMon). There are Windows performance counters for almost every aspect of the CLR and .NET Framework. They are designed to identify a performance problem by measuring, and displaying, the performance characteristics of a managed application. They are particularly useful for investigating memory management and exception handling.

There are performance counters for almost every aspect of the CLR and .NET Framework. Performance counters will, for example, provide information about:

  • The garbage collector.
  • The exceptions thrown by the application.
  • The application’s interaction with COM components, COM+ services, and type libraries.
  • The code that has been compiled by the JIT compiler.
  • Assemblies, classes, and AppDomains that have been loaded.
  • Managed locks and threads used
  • Data sent and received over the network
  • Remoted objects used by the application.
  • The security checks the CLR performs for your application.

These performance counters are always available and aren’t invasive; they have low overhead and will not affect the performance characteristics of your application significantly, unless you are monitoring a lot of performance counters.

CLR Profiler

By using the CLR Profiler, you can identify code that allocates too much memory, causes too many garbage collections, and holds on to memory for too long.

The CLR Profiler is designed to show who allocates what on the managed heap, which objects survive on the managed heap, who is holding on to objects, and what the garbage collector does over the lifetime of your application. It is provided as both a command-line application and as a Windows application. With it, you can see much of what is happening to the managed heap of a process whilst a .NET application is running, and can study the behavior of the garbage collector. It allows you to track, and graph, such aspects as object creation, memory consumed by objects, heap size, the type of objects created, and so forth.

The CLR Profiler lets you see which methods allocate which types of objects, and to observe the life-cycle of objects and what keeps objects alive. It logs when methods, classes, and modules get pulled in and called, and by whom.

The CLR Profiler starts the application it will be profiling. It slows the application down as it writes out log data and will quickly fill a large amount of disk space. It is an ‘intrusive’ tool, in that it causes your application’s performance to be between 10 to 100 times slower. If you need timings, you should use the performance metrics.

Writing optimizer-friendly code

At the heart of the .NET Framework is the Common Language Runtime (CLR). The CLR has much in common with the Java Virtual Machine. Java is compiled to Java byte code which then runs in the Java Virtual Machine (JVM). C# code is compiled to an Intermediate Language (IL) which then runs in the Common Language Runtime (CLR). On the Java platform, byte code is compiled ‘Just In Time’ (JIT) before being run on the CPU. On the .NET platform, the same happens with the IL code.

Like the JVM, the CLR provides all the runtime services for your code; Just-In-Time compilation, Memory Management and Security are just a few of these services.

When an application is run, the JIT compiler does a number of optimizations such as common sub-expression elimination, de-virtualization, use of intrinsic methods, constant and copy propagation, constant folding, loop unrolling, loop invariant hoisting, Enregistration, Method Inlining, and Range check elimination. Besides the obvious advice of keeping methods short, there is little to do to enhance any of these optimizations except, perhaps, the last three: Enregistration, Method Inlining, and Range check elimination. Although you can improve the effectiveness of compilation, and the speed of the compiled code, you would only consider changing a program in this way where you have already identified a critical section, by profiling. You can certainly prevent efficient optimization by doing code obfuscation because obfuscation can generate code that the optimizer can’t deal with, thereby causing a noticeable runtime penalty.

In order to investigate the JIT optimizations, you must look at the assembly code produced by the JIT in run-time, for the release build of your application. If you try to look at the assembly code from within a Visual Studio debug session, you will not see optimized code. This is because JIT optimizations are disabled by default when the process is launched under the debugger. You’ll therefore need to attach Visual Studio to a running process, or run the process from within CorDbg with the JitOptimizations flag on , by issuing the “mode JitOptimizations 1” command from within the CorDbg command prompt.

Enregistration

The JIT compiler tracks the lifetime of a set number of locals and formal arguments. With this information, it tries to use CPU registers where possible to store locals and method arguments (32 bit integers, object references, 8 and 16 bit integers etc). As there is only a small number of registers, the fewer variables you have, the better the chance they will get enregistered, so it makes sense to re-use a variable when possible rather than add a new one.

An enum is normally a 32-bit int for code generation purposes. Variables which are more than 32 bit in size are never enregistered.

Method Inlining

‘In-Line’ methods are simpler and therefore faster to execute. The compiler can opt to ‘in-line’ a method by inserting a child method into the parent method in its entirety. Simple, small, methods such as Property get and set, which are not virtual, do not initialize private data members, and which do not have exception handling blocks, are good candidates for ‘In-Lining’.

An ideal candidate method is one whose compiled code is 32 bytes or less, because no space is wasted; the method pointer occupies 32 or 64 bits anyway and performance is boosted because the CPU doesn’t need to “jump” as often across to different memory locations. Inline methods should not contain any complex branching logic, or any exception handling-related mechanisms, or structs as formal arguments or local variables, or 32-bit floating point arguments or return values.

You cannot ‘inline’ a virtual method call, because the actual method to be called is unknown at compile-time and can change every time the method is called.

Range Check Elimination

The compiler will generally do range checking of arrays when you are iterating through arrays in a loop using the for(i=0; i<A.length; i++) pattern. In certain cases, this is slow and unnecessary. Where you are doing simple, tight loops in code that use just the length of the array, array.length as a check for the iterative loop, and use a localref rather than static iterator, and you d o not cache array or string lengths, the compiler can recognize the fact, and will eliminate its own separate check. This can be effective when the loop is scanning large jagged arrays, for instance, as it removes implicit range-checks in both the inner and outer loops

Coding for performance

Value Types and Reference types

The CLR uses either reference types or value types. Reference types are always allocated on the managed heap and are passed by reference (as the name implies). Value types are allocated on the stack, or inline as part of an object on the heap.

Your code should use value types where a reference type isn’t needed, because value types are more economical to use, especially if they are kept small and simple. The value is copied every time you pass it to a method. Be aware, though that if you have large and complex value types, this can get expensive, so some programmers avoid value types altogether unless they are really simple. The cost of performing a garbage collection is a function of the number of live reference types your application has allocated. Value types can also be treated as objects, but they have to be ‘Boxed’. Boxing involves the costly operation of allocating a new object on the managed heap, and copying the old value into the new object. If you use a method that inherits from System.Object on a Value type, then the value is likely to be boxed into a reference type.

Collections

If you create a collection of values, rather than using an array, you will find that every item will be Boxed when added to the collection and unboxed when retrieving the value from the collection. Iterators such as the foreach statement may be expensive, as they are unrolled into a sequence of virtual calls. Generic collections such as ArrayList are not always a good choice, and it is generally better to use custom typed collections instead

A collection should be pre-sized where possible. If you don’t specify how many entries your collection is expected to contain, the original default capacity is small and when this value is exceeded, the collection gets resized. Depending on the type of the storage, a new storage object may be allocated, normally at double the capacity, and the contents of the original collection get copied to the newly allocated storage. This takes time and resources, so should be avoided where possible.

Generics

Generic collections provide a way to avoid the boxing and unboxing overhead that comes with using valuetypes in collections. Using generics isn’t a free lunch, because they have an effect on the size of the JITed code, particularly if there are a large number of closed constructed types/methods in each generic type/method definition. For best performance, there is an advantage in writing your own optimized collection classes.

Strings

Use the StringBuilder class for complex string concatenation and manipulation

Strings are immutable (that is, the referenced text is read-only after the initial allocation); it cannot be modified in place. This provides many performance benefits but is awkward for anyone who is accustomed to C/C++ string manipulation techniques. For example String.Concat() is slow and causes unnecessary copies.

If possible, avoid using foreach() rather than for to enumerate characters in strings.

Minimising start-up times

Much of the work of improving start-up times involves the relatively simple task of loading fewer modules at startup and making sure that nothing is loaded unnecessarily. It also pays to delay initialisation by, for example, doing initialisation ‘on-demand’ when the feature is first used. There is a common misconception that everything must be loaded and initialised before the application starts, so it is generally easy to improve the application startup times.

The application’s configuration settings

Registry settings and INI files present a very quick way of loading application settings. Conversely, using an XML-based config file often isn’t a good way to do this, particularly where the application loads more settings than are initially required. XML-based config files are very useful where the configuration is complex and hierarchical, but if it isn’t, then it represents an unnecessary overhead.

pre-JITing

Managed Assemblies contain Intermediate Language (IL). The CLR JIT-compiles the IL into optimized CPU instructions. The JIT compilation will take a certain amount of time but this startup time can be improved by doing compilation at install time (pre-JITing) using NGEN.exe. This produces a native binary image for the application, thereby eliminating the JIT overhead, but at the expense of portability. When an NGEN-generated image is run in an incompatible environment, the .NET framework automatically reverts to using JIT. Once NGEN is run against an assembly, the resulting native image is placed into the Native Image Cache for use by all other .NET assemblies. A pre-JITed assembly is a persisted form of a JITed MSIL with a class/v-table layout. It is always worth checking performance before and after using this technique as it can actually have an adverse effect.

If a pre-JITed assembly is not installed in the GAC (Global Assembly Cache) , then Fusion (the .NET technology used to locate and load .NET assemblies) needs to verify that the native image and the MSIL assembly have the same version to ensure that the cached native image relates to the current version of the code during binding. In order to do that, the CLR needs to access pages in the MSIL assembly, which can hurt cold startup time.

Using the Global Assembly Cache

As part of the .NET Framework installation, a central repository is created for storing .NET assemblies. This is called the Global Assembly Cache, or GAC. The GAC will have a centralized copy of the .NET Framework itself, and it can also be used to centralize your own assemblies.

If an assembly is installed in the Global Assembly Cache (GAC), it tells the CLR that changes have not occurred to it. This allows it to skip hash verification of the integrity of the assembly since the check would already have been done in order to place it in the GAC. Otherwise, if an assembly is strongly named, the CLR will check the integrity of the assembly binary on loading, by verifying that the cryptographic hash of the assembly matches the one in the assembly manifest.

It is worth avoiding the hash verification because it is CPU-intensive and involves touching every page in the assembly. The extent of the impact depends on the size of the assembly being verified.

Using Memory Sensibly

Objects in .NET are allocated from the ‘managed heap’. The managed heap is so-called because, after you ask it for memory, the garbage collector takes care of its cleanup once it is no longer required.

Garbage collection in any managed environment begins by assuming all objects are unnecessary until proven otherwise. Essentially, an object proves that it is necessary by its references, or who it is referenced by. If the object is no longer necessary, the object is flagged to be discarded. The process of allocating, managing and disposing of memory comes at a cost. Memory has to be coalesced into contiguous blocks, it must be allocated to the instance of a type, it must be managed over the lifetime of the instance, and it must be freed when it is no longer needed.

Memory Allocation

Managed heap allocations do not generally slow code too much. The implicit process is better optimized than the traditional malloc routine. Instead of scanning a linked list of memory blocks to find, or coalesce, the first block of sufficient size, it maintains a pointer. Only when memory has to be freed to get such a block does code slow down. If the Garbage collector is confronted with many ‘pinned’ objects, its task will be slower as it cannot move the memory of pinned objects because the address of the object has been passed to a native API.

Memory Management

Processing power must be used once memory is allocated to an instance, and there is a cost associated with managing memory over the lifetime of an object. The CLR garbage collector is ‘generational’ in that the managed heap contains ‘young’ new objects, longer-lived objects, and ‘old’ objects in separate ‘generations’. The Garbage Collector does not always pass over the entire heap before it can recycle memory, so it must check whether objects that have memory assigned to them still hold references to objects in younger ‘generations’. The Garbage collector will always collect the smallest section of the heap possible in order to free enough memory for the application to continue.

A scan that just does ‘young’ new objects (Generation 0) is very rapid, whereas a scan that takes in old objects (Generation 2) is likely to affect your application’s performance.

Memory assigned to a live object can be moved around in order to meet the demands for a new memory block. If an object is too large for this sort of treatment, it is allocated to a special area of the heap called the Large Object Heap, where objects are not relocated. When the garbage collector reaches the Large Object Heap, it forces a full collection. The rapid creation and disposal of large (approx >80K) object instances is likely to slow an application down.

The more objects that are created, the more allocations take place. The greater the number of allocations, the more overhead is involved in Garbage collection. For purely performance reasons, rather than logical design reasons, it is best to stick to two types of objects:

  1. Short-term objects that can contain object references to other short-term objects, or
  2. Long term objects with references only to other long-term objects.

Large short-term objects slow applications down, as do long-term objects that refer to young objects.

It is all a matter of compromise, but this may need attention if the ‘Memory: % Time in GC’ performance counter goes above 30%. The ratio between the ‘Memory: # Gen 0 Collections’ and ‘Memory: # Gen 2 Collections’ performance counters will give the best indication as to whether your ‘allocation profile’ is likely to cause the Garbage collection to slow the application down.

In the CLR, there are two different Garbage Collectors, a Workstation GC and a Server GC. The Workstation GC is optimized for low latency. It works on one thread only and therefore on one CPU only, whereas the Server GC (used by ASP.NET, for example) scales better for larger, multiprocessor applications.

Finalizing

Finalizable objects have to be maintained in the system longer than objects without finalizers. Finalization is the process whereby a dead object’s native resources such as database connections or operating system handles, are freed up if it has not been disposed. This happens before its memory is returned to the heap. Because it can happen in the background, it is queued-up (to the Finalizable Queue). The Finalization process only starts when the Garbage collector comes across the object. The object is passed to a different queue (the FReachable Queue) and the object is promoted to the next generation. The finalization process is then done by a different thread. After the object is finalized, the Garbage collector can free up and reuse the memory.

If your object does not require finalization, do not implement it because there is an overhead to calling the Finalize method. If your object needs finalization, then implement the Dispose pattern. Finalization is usually only needed when an object that requires cleanup is not deliberately killed. Don’t make the parent class ‘finalizable’. Only the wrapper class, around the unmanaged objects that need cleanup, needs to be finalizable. Make these ‘Finalizable’ objects as small and simple as possible. They should never block.

Disposing

If you no longer need an object, then it can be disposed of. “Disposing” of the object is relatively easy and well optimized in the CLR. In both VB.NET and C# you’d generally use a using block. You need to have your object implement the IDisposable interface and provide an implementation for the Dispose method. In the Dispose method, you will call the same cleanup code that is in the Finalizer and inform the GC that it no longer needs to finalize the object, by calling the GC.SuppressFinalization method.

If you have a class with a Finalizer, it makes sense use a Dispose method to cut down the overhead of the finalization. You should call just one common finalization, or cleanup, function for both the Dispose method and the Finalizer, to save on the maintenance overhead. Where a Close method would be more logical than a Dispose method, then the Close method that you write can simply call the Dispose method.

Weak References

If an object has a reference to it on the stack, in a register, in another object, or in one of the other GC Roots, by a ‘strong reference’, then the object will not be recycled by the Garbage Collector because it assumes it is still required. Unless you particularly want this, then use Weak References.

Common Language Runtime issues

Exception Handling

Exceptions represent a very efficient way of handling errors; far better than the old VB On Error Goto. However, they should only be used in exceptional or unexpected circumstances. Do not use exceptions for normal flow control: It is a common mistake in Java, VB.NET and C#, to use exception handling instead of checking for all the conditions that might cause an error. The reason for avoiding use of exceptions in normal flow control is the processing required once an exception is thrown. An expensive stack walk is required to find an appropriate exception handler for the thrown exception.

Security

If code is Fully Trusted, and the security policy remains at the default, then security has a negligible additional impact on the throughput and startup time of a .NET application. If code is Partially Trusted, or MyComputer Grant Set is narrowed, then the effect on the application will increase.

Changes to policy, such as lowering grants to the MyComputer zone or exclusive code groups, can affect performance and startup time.

When the CLR finds an Authenticode signature when it loads a strongly named assembly into a process in the .NET Framework 2.0, it generates Publisher Evidence from it. This means that has to fully validate the certificate by contacting the issuing authority across the internet to ensure the certificate has not been revoked. Applications can sometimes experience very long startup times in these cases. If the internet connection is not there, the process times-out after a long delay. In such circumstances, the application must be able to turn off auto-generation of publisher evidence, or use authenticode certificates that don’t have a Certificate Revocation List Distribution Point (CDP) embedded in them. Sometimes the application developer is unaware that he is using a third-party component that he is using a third-party component that requires a validated certificate that has to be authenticated at startup.

Late-Binding

Late-binding is generally the slowest reflection technique. Occasionally it is necessary to invoke the members of a class dynamically, or to dynamically emit, JIT and execute a method. Though at times it is very fast, this facility often brings with it a performance tradeoff. You may be using reflection, such as late-binding, without realizing it: an API that you use might, in turn, make transitive use of the reflection APIs.

If, in Visual BasicVB.NET and JScript.NET, you use a variable without explicitly declaring it, then Reflection is used to convert the object to the correct type at runtime. Unfortunately, a late-bound call is a great deal slower than a direct call. If you are using VB.NET and you don’t require late binding, you can make the compiler enforce the declaration by strongly typing your variables and turning off implicit casting. Simply include the Option Explicit On and Option Strict On at the top of your source files.

COM Interop and Platform Invoke

COM Interop and Platform Invoke are easy to use but they are significantly slower than regular managed calls. There is a fixed delay of around fifty instructions to make the transition between native and managed code, and a variable delay caused by the task of marshalling any arguments and return values. This delay depends entirely on the nature of the transformation. In some cases, the conversion of data types, such as the conversion of a string between CLR Unicode and Win32 ANSI, can be a surprisingly slow, iterative process, whereas primitive types, and arrays of primitive types, are almost free.

Where there is no alternative to Interop or P/Invoke, the number of transformations should be kept to a minimum. Code that is continuously passing data back and forth will inevitably be slow. As much work as possible should be performed inside each Interop call, so as to avoid multiple, frequent invocations. It is also sensible to copy data just once into the unmanaged world, and maintain that as a static resource.

Initialize managed threads to the appropriate COM threading model. Avoid implementing IDispatch for managed COM servers, and calling unmanaged COM servers through IDispatch.

Threading and Synchronization

Threading and synchronisation is handled by the CLR. It may seem obvious to use threading to increase the performance of an application and, certainly, multithreading can significantly increase the perceived (and sometimes the real) perforamcne of a GUI application. However, it can occasionally have the reverse effect because of the overhead involved. It generally pays to use the thread pool, and minimize the creation of threads. Avoid fine-grained locks and the use of shared locks (RWLock) for general locking.

Use asynchronous calls for improved scalability on multiprocessor machines and for improved perceived performance in Windows Forms applications.

Reflection

Reflection, which is a way of fetching type information at runtime, relies on getting at the metadata that is embedded in managed assemblies. Many reflection APIs require searching and parsing of the metadata, which takes time.

Type comparison operations take less time than Member enumerations. The latter allow you to inspect the methods, properties, fields, events, constructors and so on of a class. At design time, they are essential, but at runtime, they can cause delays, if done indiscriminately, due to the overheads associated with reflection.

Type comparisons (e.g. typeof()), member enumerations (e.g. Type.GetFields()) and member access (e.g. Type.InvokeMember()) all will use reflection and so should be avoided in performance-critical code. It is not always obvious when reflection is being used: some APIs use reflection as a side effect (e.g: Object.ToString()).

Conclusions

If you read this cribsheet in search of ‘golden’ performance rules, then prepare to be disappointed. The most important golden rule is that there aren’t many of them. Tips that make sense in one version of .NET are often irrelevant in another. The .NET framework is being subjected to an enormous effort to improve its performance, due to competitive pressure from Java. Many of the snippets of advice in this cribsheet may not be relevant or true for your particular application. You need to be suspicious of advice and should, instead, measure, profile, test, retest under a number of different test loads and settings.

Optimisation is time-consuming and unexciting. By continuously profiling and measuring, you can avoid a lot of unnecessary refactoring, and make your code easier to read for others as well.

Essential tools

  • Perfmon
  • CLR Profiler
  • Reflector.net Lutz Roeder’s Reflector is the class browser, explorer, analyzer and documentation viewer for .NET. Reflector allows to easily view, navigate, search, decompile and analyze .NET assemblies in C#, Visual Basic and IL. It is best thought of as a sort of ILDASM++. There is an excellent “Call Tree”, that can be used to drill down into IL methods to see what other methods they call.
  • ANTS Profiler. This code and memory profiler identifies performance bottlenecks in any .NET applications. It provides you with quantitative data, such as line-level code timings and hit count, allowing you to go straight down to the line of code responsible for performance inefficiencies, and optimize efficiently (It’s so much easier to make an algorithm run more efficiently when you know exactly where to focus your performance-boosting work).
  • SQL Data Generator This is ideal for checking code scalability during unit-testing. See Using SQL Data Generator with your Unit Tests for a step-by-step instructions

Handy References:

Articles

Newsgroup

  • microsoft.public.dotnet.framework.performance