Optimizing the loading and parsing of files is often given relatively little attention compared to other parts of a program. A common justification for neglecting loading speed that I have seen is that hard drives are slow relative to the processing power of a CPU, so it is not worthwhile to design efficient file loading code. The conception is that the CPU will be constantly stalled and polling the operating system for new data. In this article I will show that hard drives are faster than they are often given credit for and that significant performance improvements are possible even when reading a very simple file format.
First, let's compare the relative performance between hard drives and CPUs. Mechanical hard drives have sustained sequential read rates that range from 170 MB/s at the outer edges of the platters to 80 MB/s on the inner edges of the platters (Anandtech review). Solid state hard drives are even faster, being able to sustain over 500 MB/s (Anandtech review). Meanwhile a high-end Core i7 processor has a sustained clock speed of 3.5 GHz (Anandtech review), and my own processor is overclocked to 4.1 GHz. At best, we have 51 clock cycles per byte when reading from a slow mechanical disk, and, at worst, 8 clock cycles per byte when reading from a fast SSD before the CPU becomes the bottleneck.
To understand how efficient even the best case scenario of 51 cycles per byte requires us to be, we need context of the cost of typical operations. According to one source, a branch mis-prediction on a Core i7 costs 18 cycles, and every $for$ loop will have at least one mis-prediction when either entering or exiting the loop. The latencies of memory accesses vary wildly depending on cache locality. A table from the Intel developer forums lists the latencies for a Core i7 Xeon 5500. The best case of a L1 cache hit takes 4 cycles, and going all the way to local DRAM can take 246 cycles to according to the table below. The remote L3 and DRAM apply to multi-processor systems, whereas my computer has a single multi-core processor. Note that DRAM latency can be multiples of 60 ns when a request is not first in the queue.
These data show that, in order to reach peak performance, our parser requires tight loops that primarily access local memory that is in L1 or L2 cache. I will focus on the relatively simple task of loading a PPM image, which essentially consists of a long list of integers, preceded by a simple header. I will modify an initial working version of the image reader in a sequence of stages, where each version remains fully functional. I also provide all of the source code and test images that I used to measure performance here. Despite the simplicity of the format, I will achieve a 45X increase in speed between the first version of the program and the fastest version. Relatively complex file formats like XML will be even more CPU limited and could potentially benefit more from algorithmic improvements.
Our goal is to load an image stored on disk into the $ImageRGB$ data structure. Each pixel is a triple of red, green, and blue (RGB) values, which represent a color. An RGB image is then defined by the width and height of the image as well as the pixel colors stored in an array.
There are many image file formats of varying complexity, and the PPM format is among the simplest. Because the PPM file format is extremely simple, it should be easy to understand my modifications in the context of the whole program. The full format specification is described in the Netpbm documentation. An important consideration is that an image with the same ".ppm" extension can be one of two types called P3 and P6. Both formats have an ASCII header, but P6 store the pixel colors in binary, whereas P3 stores pixel colors in ASCII. The binary pixels are extremely easy to read, so I will focus on optimized parsing of the P3 format.
A slightly annoying feature of the PPM format is that comments can be placed anywhere within the header. A comment is denoted by a hash '#' character, after which the remainder of the line should be ignored. I use the helper function $eat\_comment$ shown below to ignore comments as they appear. The peek function is used to check if the next character starts a comment without actually consuming the character, as that character may be the leading digit of a number like the width or height of the image.
The structure of the loading function is to open the file, read the header, check the information read from the header for errors, load the pixel colors as either binary or ASCII, and then close the file. The logic within the loading function shown below is fairly simple and uses the standard C++ library. The header is tiny compared to the rest of the image data, so almost all of the processing time is spent looping over the pixels and reading their values using the $>>$ operator. Surely the standard C++ library is efficient, right?
After verifying that the code is correct by loading a couple test images, some people would call it a day at this point. I will press on though, first stopping to measure the loading times for different versions of the handsome visage shown below. There are three sizes of the image, stored in both the P3 and P6 formats. The sizes on disk are P3 (56.4 MB, 6.9 MB, and 26.8 KB) and P6 (16.1 MB, 2.0 MB, and 7.8 KB). The images are nearly constant in color, so they compress well and I have included the images with the source code.
My measured loading times in seconds for the optimized release build in MSVC are shown below. My processor is an Intel Core i7 2600K that I have over-clocked to run a 4.1 GHz. I should also clarify that for all of the tests I read the file once before timing subsequent reads, so it is likely that the files are not being read from disk, but that the OS (Window 8.1) is caching the files in system memory. As I have no direct control over file caching, I try to at least be consistent between runs.
P3 | P6 | |||||
Big | Medium | Small | Big | Medium | Small | |
Stage 1 | 6.337 | 0.783 | 0.00314 | 0.00927 | 0.000811 | 3.358e-005 |
The relative times are somewhat difficult to interpret because the file sizes are very different between images. However, taking 6.3 seconds to load a 56 MB file is clearly lower than the speed of a hard drive. The average read speed of a disk drive is around 100 MB/s, which would read 56 MB in 0.56 seconds. This back-of-the-envelope calculation suggests that the code needs to be approximately 11X faster to keep up with a disk drive, and 50X faster to match the rate of a SSD drive that reads 500 MB/s.
For performance measures in the remainder of the article I will report times in cycles per byte to give a consistent measure across file sizes. For each test I read the big file 5 times, the medium file 10 times, and the small file 30 times. After the mean number of cycles per byte, I report the 95% confidence interval. From these times, it is clear that there is a lot of room for improvement, as our target is 8 cycles per byte!
P3 | P6 | |||||
Big | Medium | Small | Big | Medium | Small | |
Stage 1 | 461.609 ± 0.120 | 462.346 ± 0.262 | 479.017 ± 0.186 | 2.440 ± 0.006 | 1.717 ± 0.053 | 18.279 ± 0.216 |
My first observation is that the standard C++ library's input and output classes are slow (and inconvenient to use). I will therefore replace all of the C++ file input operations with their C equivalents. The code translation is pretty direct. The only functionality missing in C compared to C++ that I used was peeking a character ahead in the stream, which is easily emulated by getting and ungetting a character.
Essentially I replace the $ifstream$ constructor with $fopen$, $f.close$ with $fclose$, $f.getline$ with $fgets$, and >> with $fscanf$. The only difference worth noting is that $fscanf$ on a string is less safe than reading a string with $ifstream$ because I now read into a buffer, but I am not overly concerned with security for this optimization project. C++ needs a fixed buffer also when reading the line for a comment, which could wind up not consuming the whole line and mess up parsing.
The question is if all that work of switching between standard libraries affected the speed of our program. I'm not sure if I should be happy or sad to report that, yes, the C++ standard library really is bad compared to the C library. Hooray for progress, we are now down from 461 cycles per byte to 133 cycles per byte for the big image stored in P3 ASCII. The times for reading the big binary chunks of data in the P6 format are also much lower than before.
P3 | P6 | |||||
Big | Medium | Small | Big | Medium | Small | |
Stage 1 | 461.609 ± 0.120 | 462.346 ± 0.262 | 479.017 ± 0.186 | 2.440 ± 0.006 | 1.717 ± 0.053 | 18.279 ± 0.216 |
Stage 2 | 133.137 ± 0.111 | 133.296 ± 0.113 | 138.546 ± 0.088 | 1.001 ± 0.011 | 0.330 ± 0.033 | 16.497 ± 0.441 |
My concern at this point is that our performance depends heavily on $fscanf$ being efficient. It is unlikely that $fscan$ is fast because it is written to be general purpose. Before $fscanf$ can even attempt to parse the input file stream, it needs to parse the format string passed to it, which is wasteful. My approach will be to load the entire file into memory, and then replace $fscanf$ with parsing code that is specialized to our particular file format.
For the most part, I will treat parsing of a file loaded into memory in the same way that I treated files that were streamed from disk. There is a reading head that I call $ptr$ that will advance a byte at a time. There is also an end of the file $end$ which I must not advance past. For convenience I define the data type of the reading head as a pointer to a byte.
Using the file loading function from earlier we can now define the starting and stopping points for our file loaded into memory by the code below.
Writing a custom parser sounds daunting, but is not so bad if we consider the individual parts of the file we need to process. In our case, the file can be considered to consist of three types of contiguous data: white space, integers, and comments. The format descriptor of P3 or P6 can be thought of as a special case of an integer. White space is the simplest of the contiguous regions. It is simply the stuff that separates tokens in the file. The $eat\_white$ function will advance until a non-white space character is read. In typical ASCII files there are four white space characters: new line, carriage return, tab, and space. Every parsing routine will also terminate when $ptr$ reaches $end$.
Comments are similar to white space in that they need to be consumed without affecting the rest of the file processing. We will remove any comments directly in front of the reading head with the $eat\_comment$ function, which terminates as soon as it finds a token that is not a comment. If a token starts with #, then the remainder of the line should be considered a 'comment token' and is removed by the $eat\_line$ helper function.
Finally, we have the $get\_int$ function, which removes any white space before the integer token, then passes the character pointer to the $atoi$ function in the standard library. The $atoi$ function has a useful property that it stops interpreting the C string as an integer as soon as it sees white space or a null character, so we don't need to extract out the token first and pass that to $atoi$. Instead, we will more efficiently parse the integer in place. This trick does require that we pad the file we load into memory with a terminating white space character though, because the $atoi$ function is not guaranteed to terminate at the end of the file otherwise, which could cause a crash (segfault). Once the token is read by $atoi$ I call the helper function $eat\_token$ to remove the integer we just read. Discarding a token is essentially the exact inverse of eating white space.
It is interesting to see that although we are now using much lower-level optimized parsing routines that the code that calls those routines is no more complex, and may in fact be marginally easier to read because we have removed some cumbersome temporary variables.
The question is now, of course, whether writing our own fancy parsing code gave any benefit over the tried and true parsing code of $fscanf$ in the standard library. The answer is a resounding YES! Parsing P3 files is 3.6X faster relative to our previous optimized version of the code. It is slightly annoying that our P6 code now takes an extra cycle per byte to call $memcpy$ compared to reading the binary data straight into our data structure, but at those speeds the hard drive really is the bottleneck regardless.
P3 | P6 | |||||
Big | Medium | Small | Big | Medium | Small | |
Stage 1 | 461.609 ± 0.120 | 462.346 ± 0.262 | 479.017 ± 0.186 | 2.440 ± 0.006 | 1.717 ± 0.053 | 18.279 ± 0.216 |
Stage 2 | 133.137 ± 0.111 | 133.296 ± 0.113 | 138.546 ± 0.088 | 1.001 ± 0.011 | 0.330 ± 0.033 | 16.497 ± 0.441 |
Stage 3 | 37.103 ± 0.036 | 37.131 ± 0.076 | 41.093 ± 0.075 | 2.651 ± 0.010 | 1.660 ± 0.046 | 17.600 ± 0.353 |
There is only one function from the C standard library remaining in our hot code path. If we have learned anything up to this point, it is that standard library functions can often be made faster when specialized to a restricted problem. In our case, we do not even need to parse negative integers, which means the full integer parsing code takes only a few lines.
It is interesting to note that regardless of how efficient $atoi$ was, the new code will be more efficient because, before, the memory of a token was touched twice: once by $atoi$ then by $eat\_token$. Now, the memory is touched one time. The integer conversion code is also extremely simple, and certainly simpler than $atoi$. A more subtle observation is that the new code does proper bounds checking that was not present in $atoi$, so $get\_int$ cannot read past the end of the file. This means we no longer need to pad the end of the file with a white space character, which will soon become important to our optimizations.
Again, the eternal question: how fast is the new code? The data shows that by replacing $atoi$ the loading of a P3 format file is a bit over 3X faster than our previous fastest result. We have completely replaced the file and string parsing facilities of the standard library, and gotten our code 40X faster than what we started with. Not too shabby!
P3 | P6 | |||||
Big | Medium | Small | Big | Medium | Small | |
Stage 1 | 461.609 ± 0.120 | 462.346 ± 0.262 | 479.017 ± 0.186 | 2.440 ± 0.006 | 1.717 ± 0.053 | 18.279 ± 0.216 |
Stage 2 | 133.137 ± 0.111 | 133.296 ± 0.113 | 138.546 ± 0.088 | 1.001 ± 0.011 | 0.330 ± 0.033 | 16.497 ± 0.441 |
Stage 3 | 37.103 ± 0.036 | 37.131 ± 0.076 | 41.093 ± 0.075 | 2.651 ± 0.010 | 1.660 ± 0.046 | 17.600 ± 0.353 |
Stage 4 | 11.777 ± 0.015 | 11.816 ± 0.007 | 15.640 ± 0.071 | 2.677 ± 0.052 | 1.651 ± 0.041 | 17.937 ± 0.431 |
We actually haven't completely replaced the standard library yet, which I will now rectify by using operating system specific facilities to directly map the file into the address space of our process. Essentially, I will ask the operating system to treat the file of the PPM image as an additional read-only page file, which allows me to skip the step of calling $fread$ to read the file into memory. Whenever I try to read a piece of the file that has not actually been read from disk yet, the CPU will generate a page fault signalling the operating system to load the corresponding part of the file from disk as a page-sized chunk. A page is typically 4096 bytes, but we can hope that the operating will detect our sequential access pattern and proactively read several pages in advance as a batch, because large sequential reads from a disk are important for throughput.
Mapping a file into memory rather than reading it as a single block ahead of time is important, because it potentially allows us to parse and read the file from disk in parallel. The operating system can prefetch pages from the file in a separate thread while we parse the part of the file that is already loaded in memory. If we tried doing a similar thing manually by running a separate thread that reads the file from disk, we would have to add extra logic to break the file into chunks, correctly parse across chunk boundaries, and communicate between threads. With memory mapping, the CPU itself checks every memory read against its page table and automatically generates an interrupt for the operating system when more of the file needs to be read. This a tidy solution and works completely transparently to our code.
I wrote a couple of functions to simplify asking the operating system for a memory mapped file so that the original code using $fread$
is replaced by
After reading the file we now need to unmap the file also by
Because our file parsing code already treated the file as being contained entirely in memory, we do not need to change the actual parsing code at all. For completeness, I show my implementation of $MMapFile$, $mmap\_file\_read$, and $mmap\_file\_close$ below.
If we look at our throughput, we managed to shave off about one clock cycle per byte across the board, including for the binary P6 format. In theory, our code can now read an ASCII encoded file format at 409 MB/s, which nearly matches the read speed of a high end SSD drive and is several times faster than any common disk drive. Using memory mapped IO did not show a significant difference in performance from the OS parallelizing disk reads as I had posited earlier. Instead we removed a memory copy, which takes about one cycle and is also seen in the difference of P6 times between Stage 2 and Stage 3. In order to properly evaluate the effectiveness of memory mapping, we need to read from disk.
P3 | P6 | |||||
Big | Medium | Small | Big | Medium | Small | |
Stage 1 | 461.609 ± 0.120 | 462.346 ± 0.262 | 479.017 ± 0.186 | 2.440 ± 0.006 | 1.717 ± 0.053 | 18.279 ± 0.216 |
Stage 2 | 133.137 ± 0.111 | 133.296 ± 0.113 | 138.546 ± 0.088 | 1.001 ± 0.011 | 0.330 ± 0.033 | 16.497 ± 0.441 |
Stage 3 | 37.103 ± 0.036 | 37.131 ± 0.076 | 41.093 ± 0.075 | 2.651 ± 0.010 | 1.660 ± 0.046 | 17.600 ± 0.353 |
Stage 4 | 11.777 ± 0.015 | 11.816 ± 0.007 | 15.640 ± 0.071 | 2.677 ± 0.052 | 1.651 ± 0.041 | 17.937 ± 0.431 |
Stage 5 | 10.213 ± 0.021 | 10.325 ± 0.022 | 15.982 ± 0.097 | 1.270 ± 0.006 | 1.032 ± 0.005 | 18.178 ± 0.286 |
Up to now, the files that I have read have been cached in RAM so that I am testing the CPU utilization of the parser rather than the practical disk throughput. In this section, I look at the effect that reading from a disk has on the different algorithms I have written. I used the method described in a post on StackOverflow to invalidate the file cache before loading a PPM image. Up through Windows 8, if a file is opened with buffering disabled, the cache of the file is removed. After closing the file, reopening the file by any other method requires that the file is read from disk again.
We can now compare the algorithms from before with caching enabled and disabled. The tables below show cycles per byte under the same test setup that I used before, except that I read each file 30 times. I use more iterations because reading a disk has more variability in timing than reading from memory. I also added a new row that shows the raw throughput of reading the entire file into a buffer and performing no other processing. This provides a lower bound on attainable speed. Timings of small and medium sized files are useless, because there is too much variation. My guess is that the initial seek latency dominates reading time for small files, and that latency is amortized for large files.
P3 | P6 | |||||
Big | Medium | Small | Big | Medium | Small | |
Stage 1 | 461.609 ± 0.120 | 462.346 ± 0.262 | 479.017 ± 0.186 | 2.440 ± 0.006 | 1.717 ± 0.053 | 18.279 ± 0.216 |
Stage 2 | 133.137 ± 0.111 | 133.296 ± 0.113 | 138.546 ± 0.088 | 1.001 ± 0.011 | 0.330 ± 0.033 | 16.497 ± 0.441 |
Stage 3 | 37.103 ± 0.036 | 37.131 ± 0.076 | 41.093 ± 0.075 | 2.651 ± 0.010 | 1.660 ± 0.046 | 17.600 ± 0.353 |
Stage 4 | 11.777 ± 0.015 | 11.816 ± 0.007 | 15.640 ± 0.071 | 2.677 ± 0.052 | 1.651 ± 0.041 | 17.937 ± 0.431 |
Stage 5 | 10.213 ± 0.021 | 10.325 ± 0.022 | 15.982 ± 0.097 | 1.270 ± 0.006 | 1.032 ± 0.005 | 18.178 ± 0.286 |
RAW | 1.993 ± 0.007 | 1.916 ± 0.016 | 5.274 ± 0.136 | 2.012 ± 0.007 | 1.178 ± 0.012 | 17.043 ± 0.276 |
P3 | P6 | |||||
Big | Medium | Small | Big | Medium | Small | |
Stage 1 | 464.65 ± 0.61 | 464.17 ± 0.97 | 558.58 ± 68.12 | 26.64 ± 1.07 | 17.48 ± 3.44 | 622.26 ± 411.17 |
Stage 2 | 137.33 ± 1.13 | 141.89 ± 2.70 | 254.20 ± 106.32 | 24.63 ± 1.70 | 14.29 ± 3.63 | 476.52 ± 357.36 |
Stage 3 | 71.26 ± 0.56 | 51.37 ± 4.19 | 170.18 ± 100.19 | 26.33 ± 1.22 | 20.79 ± 5.12 | 1075.14 ± 1099.87 |
Stage 4 | 45.87 ± 0.48 | 25.57 ± 4.26 | 150.09 ± 102.66 | 26.59 ± 1.33 | 29.63 ± 15.84 | 670.38 ± 455.00 |
Stage 5 | 36.04 ± 0.87 | 23.76 ± 2.42 | 96.35 ± 81.43 | 23.77 ± 2.14 | 24.00 ± 10.09 | 706.50 ± 1051.06 |
RAW | 36.01 ± 0.39 | 16.51 ± 2.71 | 134.13 ± 95.82 | 23.01 ± 1.61 | 22.80 ± 9.08 | 649.85 ± 382.02 |
The time taken to read the binary P6 format has a raw disk throughput of approximately 23 cycles per byte, which translates to 170 MB/s. Stages 1 and 2 use the C++ and C standard libraries to read directly into the image buffer, and C is a little faster than C++. Stages 3 and 4 of the optimization first read the file into a temporary buffer which is copied into the image buffer. Copying data adds a few cycles per byte regardless of caching. Stage 5 uses memory mapping to remove the extra memory copy and is the same speed as reading binary in C. Despite small differences, all of the methods have nearly optimal disk throughput for the binary format.
The ASCII P3 format has slightly more complicated timings. If reading the disk and parsing the data are serial operations, the expected rates are the rate from cached files plus the raw 36 cycles per byte from reading the disk. The standard libraries of Stages 1 and 2 only gain 3 cycles per byte, which indicates that the libraries perform IO in parallel to parsing. Stages 3 and 4 are serial, because they read the entire file before processing it, which is reflected in their timings. Finally, it appears that the OS parallelizes memory-mapped disk reads, because the parser of Stage 5 operates at the maximum achievable speed of 108 MB/s.
People often claim that disk speed is the limiting factor when reading files. Through testing and optimization, I conclude that disk speed can be the limiting factor when parsing a text file, but that in the usual case, the CPU overhead of parsing dominates. In my test, I had raw read throughputs of 108 MB/s and 170 MB/s. Using C++ to read integers from the file was 13 to 20 times slower than the disk, and even C was 4 to 6 times slower. Only through extensive optimization could I parse files 2 to 3 times faster than the disk so that the disk was actually the limiting factor.
The throughput of disks is greater than people tend to imagine, and code has to be very sparing in cycles to match disk performance. Disks are not even the highest bandwidth devices attached to a computer. Gigabit ethernet is extremely common, and matches disk speeds. Datacenters now have 10 gigabit ethernet cards attached to servers, which leaves very little processing per byte sent or read. The example I used in this article of parsing natural numbers is simple, but parsing floating point numbers is more CPU intensive. Parsing complex files like XML, JSON, or HTTP requests is even more challenging. The next time that you write a parser, give serious thought to how fast it can read data.