We were meeting with a client recently who was experiencing some major issues with performance with one of their key applications. Their staff had identified the issue as being a disk performance issue – there was simply a ton of data to read and write to disk and it was causing a huge bottleneck in performance. Understand that there was no database involved – the analysis process consisted a series of small command-line applications that would take a CSV file as an input, massage the data, and output a CSV file that would then be used as the input for another command-line application. Although each individual app was relatively small, the entire system was quite large when you put them altogether.
Most of our discussions revolved around bringing in higher speed single drives or implementing a RAID array to help with IO performance because they were looking for a quicker fix than rebuilding the application – but I started to think about how to increase application performance without having to complete a major rewrite because that’s way more fun to think about than a hardware solution (to me at least). Since their data was coming in as a CSV file, I figured they were reading the numbers as string, converting those numbers to integers, and storing them in a data structure of sorts. So I began wondering what kind of performance gain they could get if the data was read in once and serialized back and forth using .NET binary serialization.
My test consisted of generating a byte array of various item lengths and writing that file as a CSV file as well as using the .NET binary serializer to write the binary representation to a file as well. Then I read that file back into memory 25,000 times and calculated the operations per second for each method. I also recorded the file size (in KB). My first run only included reading the text into a string array – it did not include any casting from a string into an int. The findings are as follows:
Items |
Size (Text) |
Size (Binary) | Ops/Sec (Text) | Ops/Sec (Binary) |
100 | 359 | 128 | 14.03 | 9.07 |
200 | 710 | 228 | 12.54 | 11.07 |
300 | 1071 | 328 | 10.37 | 10.92 |
400 | 1421 | 428 | 9.69 | 11.06 |
500 | 1773 | 528 | 8.62 | 10.90 |
600 | 2129 | 628 | 7.64 | 10.93 |
700 | 2484 | 728 | 7.01 | 11.17 |
800 | 2882 | 828 | 6.51 | 11.13 |
900 | 3218 | 928 | 6.15 | 10.91 |
1000 | 3592 | 1028 | 6.04 | 10.85 |
2000 | 7154 | 2028 | 3.61 | 10.83 |
3000 | 10719 | 3028 | 2.54 | 10.65 |
4000 | 14305 | 4028 | 1.99 | 10.49 |
5000 | 17793 | 5028 | 1.55 | 9.53 |
6000 | 21353 | 6028 | 1.31 | 8.16 |
7000 | 25046 | 7028 | 1.14 | 8.25 |
8000 | 28557 | 8028 | 0.97 | 9.24 |
9000 | 32083 | 9028 | 0.89 | 8.04 |
10000 | 35709 | 10028 | 0.71 | 9.28 |
11000 | 39221 | 11028 | 0.66 | 9.12 |
12000 | 42789 | 12028 | 0.61 | 9.19 |
13000 | 46389 | 13028 | 0.57 | 8.85 |
14000 | 49953 | 14028 | 0.53 | 9.07 |
15000 | 53482 | 15028 | 0.50 | 8.99 |
As you can see, the text import quickly degrades – I graphed it out and it follows a curve that degrade exponentially. File sizes tend to follow a linear trend with the binary size about 30% of the size of the text file size. For smaller number of items (Under 300) it looks like the text import works faster than the binary import.
Next, I tried to see what the performance would like with parsing each text element into a byte:
Items | Size (Text) | Size (Binary) | Ops/Sec (Text) | Ops/Sec (Binary) |
100 | 359 | 128 | 11.20 | 11.06 |
200 | 719 | 228 | 8.55 | 10.94 |
300 | 1065 | 328 | 6.67 | 11.02 |
400 | 1434 | 428 | 5.60 | 11.03 |
500 | 1798 | 528 | 4.84 | 11.00 |
600 | 2153 | 628 | 4.21 | 11.07 |
700 | 2512 | 728 | 3.76 | 11.00 |
800 | 2864 | 828 | 3.41 | 10.96 |
900 | 3210 | 928 | 3.12 | 10.95 |
1000 | 3554 | 1028 | 2.89 | 10.95 |
2000 | 7176 | 2028 | 1.61 | 10.78 |
3000 | 10761 | 3028 | 1.11 | 10.64 |
4000 | 14253 | 4028 | 0.86 | 10.14 |
5000 | 17820 | 5028 | 0.68 | 9.61 |
6000 | 21340 | 6028 | 0.57 | 7.87 |
7000 | 24936 | 7028 | 0.50 | 9.45 |
8000 | 28528 | 8028 | 0.43 | 8.08 |
9000 | 32084 | 9028 | 0.38 | 8.03 |
10000 | 35638 | 10028 | 0.33 | 8.09 |
11000 | 39234 | 11028 | 0.30 | 9.31 |
12000 | 42886 | 12028 | 0.28 | 9.35 |
13000 | 46377 | 13028 | 0.26 | 9.24 |
14000 | 49991 | 14028 | 0.24 | 9.14 |
15000 | 53499 | 15028 | 0.23 | 8.99 |
When performing type casting, the text-based approach is only faster when there are about 100 items and it tails off even more dramatically. I would venture a guess that if I was deserializing a more complex object instead of a primitive type, that the cost of casting would be even greater. The binary approach, on the hand, seems to hold fairly steady regardless of the amount of items with which it is dealing (through both tests even).
Just thought it was kind of interesting if you were ever wondering about binary serialization performances statistics (albeit only for integer arrays).
Load comments