Performance Considerations: Binary Serialization Efficiencies

We were meeting with a client recently who was experiencing some major issues with performance with one of their key applications. Their staff had identified the issue as being a disk performance issue – there was simply a ton of data to read and write to disk and it was causing a huge bottleneck in performance. Understand that there was no database involved – the analysis process consisted a series of small command-line applications that would take a CSV file as an input, massage the data, and output a CSV file that would then be used as the input for another command-line application. Although each individual app was relatively small, the entire system was quite large when you put them altogether.

Most of our discussions revolved around bringing in higher speed single drives or implementing a RAID array to help with IO performance because they were looking for a quicker fix than rebuilding the application – but I started to think about how to increase application performance without having to complete a major rewrite because that’s way more fun to think about than a hardware solution (to me at least). Since their data was coming in as a CSV file, I figured they were reading the numbers as string, converting those numbers to integers, and storing them in a data structure of sorts. So I began wondering what kind of performance gain they could get if the data was read in once and serialized back and forth using .NET binary serialization.

My test consisted of generating a byte array of various item lengths and writing that file as a CSV file as well as using the .NET binary serializer to write the binary representation to a file as well. Then I read that file back into memory 25,000 times and calculated the operations per second for each method. I also recorded the file size (in KB). My first run only included reading the text into a string array – it did not include any casting from a string into an int. The findings are as follows:

Items

Size (Text)

Size (Binary) Ops/Sec (Text) Ops/Sec (Binary)
100 359 128 14.03 9.07
200 710 228 12.54 11.07
300 1071 328 10.37 10.92
400 1421 428 9.69 11.06
500 1773 528 8.62 10.90
600 2129 628 7.64 10.93
700 2484 728 7.01 11.17
800 2882 828 6.51 11.13
900 3218 928 6.15 10.91
1000 3592 1028 6.04 10.85
2000 7154 2028 3.61 10.83
3000 10719 3028 2.54 10.65
4000 14305 4028 1.99 10.49
5000 17793 5028 1.55 9.53
6000 21353 6028 1.31 8.16
7000 25046 7028 1.14 8.25
8000 28557 8028 0.97 9.24
9000 32083 9028 0.89 8.04
10000 35709 10028 0.71 9.28
11000 39221 11028 0.66 9.12
12000 42789 12028 0.61 9.19
13000 46389 13028 0.57 8.85
14000 49953 14028 0.53 9.07
15000 53482 15028 0.50 8.99

As you can see, the text import quickly degrades – I graphed it out and it follows a curve that degrade exponentially. File sizes tend to follow a linear trend with the binary size about 30% of the size of the text file size. For smaller number of items (Under 300) it looks like the text import works faster than the binary import.

Next, I tried to see what the performance would like with parsing each text element into a byte:

Items Size (Text) Size (Binary) Ops/Sec (Text) Ops/Sec (Binary)
100 359 128 11.20 11.06
200 719 228 8.55 10.94
300 1065 328 6.67 11.02
400 1434 428 5.60 11.03
500 1798 528 4.84 11.00
600 2153 628 4.21 11.07
700 2512 728 3.76 11.00
800 2864 828 3.41 10.96
900 3210 928 3.12 10.95
1000 3554 1028 2.89 10.95
2000 7176 2028 1.61 10.78
3000 10761 3028 1.11 10.64
4000 14253 4028 0.86 10.14
5000 17820 5028 0.68 9.61
6000 21340 6028 0.57 7.87
7000 24936 7028 0.50 9.45
8000 28528 8028 0.43 8.08
9000 32084 9028 0.38 8.03
10000 35638 10028 0.33 8.09
11000 39234 11028 0.30 9.31
12000 42886 12028 0.28 9.35
13000 46377 13028 0.26 9.24
14000 49991 14028 0.24 9.14
15000 53499 15028 0.23 8.99

When performing type casting, the text-based approach is only faster when there are about 100 items and it tails off even more dramatically. I would venture a guess that if I was deserializing a more complex object instead of a primitive type, that the cost of casting would be even greater. The binary approach, on the hand, seems to hold fairly steady regardless of the amount of items with which it is dealing (through both tests even).

Just thought it was kind of interesting if you were ever wondering about binary serialization performances statistics (albeit only for integer arrays).