Geolocation is sort of neat. Whenever a connection over IP is made, you know the IP address of the client. Based on that IP, it is possible to find the location of the connecting computer fairly accurately. This is because computers are issued IP addresses from an ISP, and when the ISP buys a block of IP addresses, they have to register the city and country they are based in. I found a free database mapping IP addresses to geographical information that is provided by MaxMind. The business model of MaxMind appears to be that they provide some information for free to entice you into licensing the full database. The free database has plenty of info for me, listing City, Country, Language, Latitude, and Longitude.
The fun optimization challenge comes from this benchmark page. As far as I can determine, MaxMind provides the fastest API for location queries compared to any other provider. They can answer over 2.6 million city queries per second. There are many other geolocation companies, but they tend to require you to make HTTP requests to query a database on their server, which is incredibly inefficient. MaxMind the only company I have seen that advertizes a C library coupled with a binary format that is optimized for speed. Timing info is a bit hard to find, but some other guy seemed pretty proud of getting SQL to answer 20,000 queries per second.
It is hard to judge what speed I will be able to attain. 2.6 million queries per second sounds fairly fast, but that is still taking about 1000 cycles on average for their 2.53 GHz Core i5. I think I have a good chance of beating them.
Rather annoyingly, the provided library does not compile on Windows and they don't provide precompiled binaries. Rather, I should say that there is a Visual Studio 2013 project that may possibly compile in Windows, but I don't have the most recent version of Visual Studio, so that is useless to me. The source code and build system looks rather specific to Unix operating systems, and just tossing the source code at Visual Studio 2012 gave tons of errors. I don't want to try to decipher how the complicated garbage of configure translates to Windows. This means I will have to port my code to Linux to do testing and will test on my laptop (a Surface Pro) that has Linux on it.
The MaxMind C library is open-source, so I looked a little at their code and read their helpful description of the algorithm they use. It turns out that they memory map the binary file, so their file format is exactly their internal data structure. This indicates they are taking performance seriously.
The gist of their algorithm is that they have a binary tree stored as a contiguous array at the head of the file. Each node contains an offset into the file for the left and right subtrees. This offset serves a dual purpose. If the offset is less than the number of bytes used for internal nodes in the tree, the offset is interpreted as a pointer to another internal node. If the offset is past the array of internal nodes, the pointer is interpreted as a pointer to a data record. The data records have variable size and do some tricks with unions. The tree is traversed by reading the bits in the IP address from most to least significant, and taking branches based on the value of the bit.
The format is clever. I like how they use specialized leaf and internal nodes in the tree, and the type is implicit. Their data structure is agnostic to the number of bits in an address, so they store IPv4 and IPv6 addresses in the same tree. They have some comments in the code that make me speculate that they may also optimize the data structure to combine some nodes so that it is actually a directed graph.
Clever as their data structure and code are, I see potential for improvement. My main critique of their search algorithm is that I think the data access pattern is going to be fairly random, and the time is going to be dominated by cache misses. Their binary file is 32MB, which is small enough that frequently used data should mostly fit in L3 cache (my desktop machine has 8MB of L3). Mostly fitting in cache will help if you just hammer at the geolocation database, but if geolocation is done less frequently so that other code pollutes the cache, I predict that performance will drop significantly. Supporting both IPv6 and IPv4 simultaneously is convenient, but doesn't seem like a critical feature to me. Nobody uses IPv6, as demonstrated by there being over 30x more IPv4 geolocations than there are IPv6 geolocations.
Lets get a feel for the data we will need to query. I will parse the text-based CSV files that they provide and convert their data into my own format. Split between two files, there are 2,968,938 entries in the list of IP blocks broken down to city level and 86,531 location entries. The IP blocks store IDs that appear to be unique hashes and that reference the list of locations. There are 34 times as many IP blocks as locations, which is unfortunate.
Looking at a neighboring pair of representative entries in the list of IP blocks, a lot of the data is missing. This was presumably removed from the free data set intentionally. The bit-oriented nature of their algorithm is apparent from the use of the X.X.X.X/Y slash notation.
Based on the location ID, I can find the relevant entry in the location database.
The CSV format has a lot of redundant information. Sequential blocks of IPs could be merged into a single range, which will reduce the ratio of blocks to locations. The country ID, postal code, latitude, and longitude that are stored in the IP block should really be moved into the location entry instead. Actually, the country ID is essentially in the location entry already and should be removed.
The parsing code was not terribly difficult, but took longer than I would like to admit. I generally prefer to parse files in a way that is similar to how I described in my article on parsing a PPM image. I didn't care too much about parsing speed, so used a framework that is a bit more general and a bit slower. The only tricky bit was that the CSV file separates entries by commas, but some entries have quoted strings with commas in them.
My final results are the following. There are 2,968,938 entries in the file and there were 1,787,362 entries left after merging neighboring blocks. Merging therefore saved about 40% of the memory for IP bocks. Not all blocks have a city or country geoid. If the city wasn't specified, I used the country instead, and there were 244 blocks that had neither a city or country. I opted to merge those unknown blocks with whatever preceded them. After merging, every combination of 32 bits had some location associated to it. I also pruned unused locations. It turns out that 114 of the 86,531 location entries are never referenced by any IP block.
My data structure is quite different than the one used by MaxMind. First I will explain the basic idea of how my algorithm works, then explain my actual implementation later. Imagine that blocks are represented by the first IP address in the block and that the IPs are stored in a sorted list. The IPs associated with a block are all the IPs from the block identifier up to the next IP stored in the list. Given an IP query, my job is to find the largest IP in the list that is less than or equal to my query.
Only IPs are stored in the list, but I record the index $i$ of the entry that I find in the list. Imagine now that I have a list of locations that are arranged in the same order as the blocks. All the location information about block $i$ is stored in the separate list of locations at index $i$.
That setup works, but is inefficient in its use of memory. There are far more blocks than actual locations, so duplicating the full location information is wasteful. I therefore don't store a full location for each block, but add a layer of indirection by using three lists. Rather than the list of locations I previously described, I store an index into a shorter list of locations that doesn't have duplicate entries. In code, the arrays are:
Querying data then looks similar to this:
The trick is to make the search $findBlockIndex$ as fast as possible. In a search, there are essentially 2 costs: comparisons and memory fetches. In our case, comparisons are extremely cheap because a pair of 32 bit integers can be compared in a single clock cycle. With 1,787,362 IP blocks, we will want to use a hierarchical search, like a binary search or a tree of some sort. These searches all have the characteristic of creating a chain of dependent memory fetches. Memory latency will dominate our search time. The table shown below from the Intel developer forums lists approximate latencies of different memory. Our data is large and our accesses scattered, so we will have lots of fetches from L3 (~40 cycles) and from DRAM (~60 ns).
We want to make as many of our reads be from L1 cache as possible. It is therefore important to understand how cache works. Although the CPU can address individual bytes, the memory system works at the coarser level of cache lines. For current Intel processors, a cache line is 64 bytes long, and each line starts at an integer multiple of 64 bytes. If you request memory that is not in L1 cache, the memory system will look in L2, then L3, then finally in DRAM. Once the cache line is found, it populates all levels of cache up to L1. On every cache miss, 64 bytes will be read regardless if you use them or not, so it is best to use them all.
I organize the IPs into a tree with a large branching factor, related to the idea of a B-tree. An IPv4 address is 4 bytes long and a cache line is 64 bytes, so each node can store 16 IPs in a cache line. This results in a branching factor of 17. Unlike a B-tree, I only ever read data, so I pack the tree as densely and uniformly as possible. In fact, I flatten the tree into an array, as is typically done for a heap. This means I don't need to store any ancillary information within nodes like the number of IPs in the node or pointers to children. Because of the large branching factor, I expect to locate a block in about $log_{17}1,787,362$ vs the $log_{2}2,968,938$ fetches of cache lines of MaxMind, and so expect to be about 4x faster.
In terms of indices, 86,531 locations are too many to index using 16 bits. I decided to use a 32 bit integer rather than being fancy and using 24 bits because I doubt it matters all that much in terms of performance. This layer of indirection doesn't exist in the MaxMind format, which is unfortunate for me, but I should still have fewer cache misses.
Now we get to the location data, where I cheat a little bit, because I don't care about and therefore don't store some of the information that MaxMind has. In terms of pure lookup performance, this shouldn't matter. If you return a pointer to data, you don't actually pay the cost of reading the data until you read through the pointer. Regardless, I ran some statistics on the database and found that the longest strings for the names of countries, states (subdivisions), and cities were 44, 72, and 64 bytes respectively, totaling 180 bytes. However, the largest sum of lengths within a record was just 91 bytes. Null terminating the strings brings that up to 94 bytes. That means that I can easily fit latitude, longitude, and the names within two cache lines with room to spare.
If I was more pressed for space, I could compress latitude and longitude into 16 bit values with worst case error of 0.5 miles. That would give me 28 spare bytes for some extra information for stuff like two letter country codes or area codes. I make sure the records are cache aligned to avoid unnecessary cache misses. Consider that by disregarding alignment, a 2 byte record could span two 64 byte cache lines, or a 66 byte record could span three cache lines.
I decided to use the same approach as MaxMind did for file I/O. I memory map a binary file that has the data already arranged in memory as I will use it. All I do to load the database is calculate a couple of pointer offsets. When I write the binary file I make sure to align all arrays to multiples of 64 bytes. A convenient property of memory mapped files is that the file contents are guaranteed to align to a page boundary, which is typically 4096 bytes, and therefore means the data is also cache aligned. Without error handling, the entire loading code is as follows.
I spent a while trying to get the C library working. First I tried using the libGeoIP that is available through apt-get. I used the example code and made a minimal program. Great, or... not. The GeoIP library turns out not to work with the new database files on the website. I then downloaded libmaxminddb version 1.0.3, and on Linux it was a snap to configure, make, make install, ldconfig. Now to actually look up some IPs.
This is where I go on a minor rant. I hate the MaxMind API. I'd like to think that the API is a perverse joke, like the language INTERCAL, but I don't think that it is. I spent several hours trying to figure out how to get information out of the database using their API and could not figure it out. First, it is difficult to lookup an IP address. There are two options: pass an address as an ASCII string, or pass a $struct sockaddr$ pointer. Clearly, converting an IP to a string, then parsing that string back into an IP is retarded. That is what they use for their benchmark in the example program they provide though. The endian order of sockaddr_in is reversed, but I don't know if it is supposed to be that way or not. Let me show my lookup code vs theirs to illustrate the overly complicated API.
Mine:
Theirs:
I wrapped the lookup into $mmLookup$, because there is no way I am retyping that monstrosity each time I need to look something up. A wrinkle to notice is that for the MaxMind API, the pointer $edl$ needs to be freed. That is because the data is actually copied into a list of nodes, where each node is allocated using $malloc$. Allocating and freeing memory has terrible performance reprecussions. My code just gives a pointer to static data in a memory mapped file, so does not need to be freed. Also, I guarantee that a valid location will be returned for any IP address, so no null pointer check is necessary.
Now we get to the point of actually trying to access data from a location entry. In my code this is very simple.
In libmaxminddb, I never figured out how to get the data into my program. The example program provided along with the library does not show how to do so, and combing through the headers and source left me confused. The function
looks promising, but all of the arguments are constant. This means the only output is the integer that is returned. However, that just gives the status of whether the path to the data in the record was found. The only way of extracting data seems to be the one given in the example code, which is to dump JSON formatted data to a file stream.
So, for example, if you want to get the name of the country of someone that is visiting your website, you open two file streams that are piped together then pass one end of the pipe to $MMDB\_dump\_entry\_data\_list$. That will walk through the linked list and write a JSON record. You then read that record, parse it (presumably with some off-the-shelf JSON library), and extract the relevant information from the parse tree. If this is truly the case, it is ridiculous. I can at least confirm that the locations they give and that I give are the same. Both say I am in Sammamish, WA, which is a city that neighbors the city I live in. Here is the lovely JSON output.
I did all testing on my first generation Surface Pro with 64 bit Ubuntu. According to /proc/cpuinfo, my processor is an Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz, with a cache size of 3072 KB. I compiled with whatever the default release settings are for QtCreator, which used GCC 4.8.2. I used the high precision $gettimeofday$ function to measure time deltas. I stored 10 million uniform random IP addresses calculated with the Mersenne Twister pseudorandom number generator. Storing the IPs in an array both allows me to give the same inputs to both algorithms and reduces the overhead of creating IPs in the timing loop. I couldn't figure out how to extract info from the record pointer provided by MaxMind, so I only tested time to return a record pointer for both of the algorithms. In order to ensure that the lookup code was not optimized out, I calculate a bitwise or of the returned pointers that I print at the end. The results were:
Josiah | MaxMind | |
Time 10M queries | 0.7861 s | 71.94 s |
Rate of queries | 12.72 M/s | 0.1390 M/s |
I feel a bit disappointed that the performance of MaxMind's library is 92x slower. This is significantly slower than is reported on their benchmark page, which I believe reports the timings of their previous library that has been discontinued. There, they report a rate of 2.67 M/s on an Intel Core i5 2.53 GHz machine. My rate is about 4.75x faster than that, which is in the ballpark of what I expected from using a B-tree instead of a binary search. I'm not sure if it fair to adjust for clock speed or not, because it is likely that memory latency rather than CPU speed is the bottleneck. If speed is proportional to CPU clocks, I am 7x faster than the old library.
It feels a little cheap to compare times taken to return a pointer, because my method does not ever touch the memory with location information, which saves me two cache misses. In contrast MaxMind touches the memory and copies it into new memory that it allocates and frees. That said, I made the test as fair as I could and this is the same benchmark test provided in the MaxMind example code. I would have had both methods do something with the location data if the MaxMind API allowed me to. From what I have seen so far, it may be for the best that MaxMind was saved from the embarrassment of additional tests.