Databases holy war. Yet another chapter.

Near a year ago we looked for a geolocation solution to complement the functionality of one of our projects. We expected to deal with hundreds of thousands of records at the beginning and to grow up to 10-15 million records. Finally, we’ve picked up PostgreSQL and its PostGIS extension.

We’ve decided to set amount of records to 100M, so to make it worth of telling about here. The key goal is to compare both DBs spatial indexes possibilities and speed.

Possible solutions

You’ll probably end up with some popular solutions like:
– PostgreSQL & PostGIS
– MySQL
– solr
– MongoDB

Each of them has its own pros and cons, but for a several reasons we decided to choose finally between PostgreSQL and MySQL.

Server configuration

Our benchmark ran on virtual server having power of all cores of physical CPU (Intel Xeon E31275 @ 3.40GHz) and with 4 Gb of RAM. Hard disks maintained by LSI MegaSAS 9260 controller.

Software versions & configuration details

Software versions we used in our benchmarking:
PostgreSQL: 9.1.2
PostGIS: 1.5.3
MySQL: 5.5.17

MySQL configuration adjustments are quite brief:
key_buffer = 2GB sort_buffer_size = 7M read_buffer_size = 7M read_rnd_buffer_size = 7M myisam_sort_buffer_size = 8MB myisam_max_sort_file_size = 128MB thread_concurrency = 16

PostgreSQL changes are compact too:
shared_buffers = 2GB temp_buffers = 8MB work_mem = 7MB effective_io_concurrency = 8 effective_cache_size = 2GB

DBs tables setup and notes

MySQL’s create table statement:
CREATE TABLE locations ( id serial, name varchar(255), latitude decimal(12, 8), longitude decimal(12, 8), location Point NOT NULL ) ENGINE=MyISAM;

As you may see engine type here is MyISAM. That’s because only this type of storage engine supports spatial indexing.

Spatial index creating statement:
CREATE SPATIAL INDEX idx_locations ON locations(location);

PostgreSQL create table statement:
CREATE TABLE locations ( id serial UNIQUE, name varchar(255), latitude decimal(12, 8), longitude decimal(12, 8), location GEOGRAPHY(POINT,4326) );

Spatial index creating statement:
CREATE INDEX idx_location ON locations USING GIST(location);

Note: some table columns were skipped

Benchmarking methodology

Our business needs included searching for locations within some radius having coordinates of the center. While benchmarking, we’re taking 1000 locations and searching for other locations using coordinates of each selected row within 1000 meters around it. This gives us 1000 lookups. Every hundred of lookups we’ve printed time to see a progress.

Also, we measured performance producing 20 parallel requests, each with 30 points to look around.

Our set consists of 100M records placed randomly within -20..20 degrees both longitude and latitude.

In order to have benefits from spatial indexing in MySQL you’ll have to issue several queries (they may be packed in stored procedure):
set @lng1 = ${lng} - 1000/abs(cos(radians($lat)) * 111133.00000000001); set @lng2 = ${lng} + 1000/abs(cos(radians($lat)) * 111133.00000000001); set @lat1 = ${lat} - (1000/111133.00000000001); set @lat2 = ${lat} + (1000/111133.00000000001); set @search_area = PolyFromText(CONCAT('Polygon((', @lng1, ' ', @lat1, ',', @lng1, ' ', @lat2, ',', @lng2, ' ', @lat2, ',', @lng2, ' ', @lat1, ',', @lng1, ' ', @lat1, '))')); SELECT id, name, latitude, longitude FROM businesses WHERE MBRWithin(location, @search_area) ORDER BY id ASC;

Those statements will give us a square with the given coordinates in the point of its diagonals intersection. The distance from this “center” point to all vertexes will be 1000 meters (the distance we’re searching within in our benchmark).

PostGIS query is much more compact than MySQL one:
SELECT id,name, latitude, longitude from businesses where ST_DWithin(location, ST_PointFromText('POINT($lng $lat)', 4326), 1000) ORDER BY id ASC;

Shell variables $lng and $lat are substituted with coordinates of the points we’re going to search around during queries generation.

Benchmarking results

Note: while MySQL spatial extension is built-in and did not require any special setup, PostGIS is a separate one and its setup documentation suggests performing index clustering (cluster index statement), without this step performance of this extension is simply awful (3-4 times slower than results below you’ll see).

Before taking in count any results, we’ve ran several warm-up cycles. After that we’ve taken 3 runs for each DB. Each run used new set of 1000 points to search. This was done manually by changing offset for initial select query.

So finally for 100M of records next benchmarking results were received:

Consecutive mode

MySQL	PostgreSQL/PostGIS
run #1
19.427407 secs (100 points done)	16.545106 secs (100 points done)
40.096223 secs (200 points done)	34.499818 secs (200 points done)
60.424349 secs (300 points done)	52.917147 secs (300 points done)
80.402063 secs (400 points done)	71.151048 secs (400 points done)
99.697443 secs (500 points done)	88.908134 secs (500 points done)
119.500017 secs (600 points done)	107.033202 secs (600 points done)
138.697637 secs (700 points done)	123.489839 secs (700 points done)
156.777121 secs (800 points done)	141.127636 secs (800 points done)
174.214299 secs (900 points done)	158.778954 secs (900 points done)
192.821354 secs (1000 points done)	175.250103 secs (1000 points done)
run #2
18.773068 secs (100 points done)	18.104028 secs (100 points done)
37.957617 secs (200 points done)	35.822432 secs (200 points done)
56.055244 secs (300 points done)	53.277242 secs (300 points done)
75.594432 secs (400 points done)	70.921150 secs (400 points done)
93.304769 secs (500 points done)	88.077090 secs (500 points done)
111.630544 secs (600 points done)	105.480567 secs (600 points done)
130.219551 secs (700 points done)	122.443156 secs (700 points done)
148.340222 secs (800 points done)	139.523156 secs (800 points done)
167.564858 secs (900 points done)	156.752994 secs (900 points done)
186.036480 secs (1000 points done)	173.929666 secs (1000 points done)
run #3
20.161445 secs (100 points done)	16.168425 secs (100 points done)
39.078161 secs (200 points done)	33.499001 secs (200 points done)
56.925266 secs (300 points done)	50.269011 secs (300 points done)
76.932744 secs (400 points done)	67.876175 secs (400 points done)
95.756943 secs (500 points done)	83.846680 secs (500 points done)
113.185249 secs (600 points done)	99.359155 secs (600 points done)
132.055828 secs (700 points done)	115.621416 secs (700 points done)
149.872986 secs (800 points done)	133.139235 secs (800 points done)
167.135775 secs (900 points done)	148.978048 secs (900 points done)
185.455375 secs (1000 points done)	165.818556 secs (1000 points done)

Parallel mode (20 processes x 30 searches)

MySQL	PostgreSQL/PostGIS
run #1
real 2m3.099s	real 1m21.131s
run #2
real 2m3.049s	real 1m22.288s
run #3
real 1m57.568s	real 1m20.559s

Averaging these results we have next ones:
Consecutive run (100M of records):

MySQL:	188.104403s or 188ms per point
PostgreSQL/PostGIS:	171.666108s or 172ms per point

Parallel run (100M of records):

MySQL:	121.239s or 202ms per point (600 points in total)
PostgreSQL/PostGIS:	81.326s or 136ms per point (600 points in total)

As a result, PostgreSQL/PostGIS appears to be 9.6% faster than MySQL in consecutive mode and 48.6% faster in parallel mode we’ve used for benchmarking.

Further more there’s one more benchmark to prove one important thing: spatial indexes doesn’t depend much on records amount in a table.

So let’s benchmark the same lookups but with 50M of records seeded with higher density (square with -10..10 for both latitude and longitude).

Consecutive mode

MySQL	PostgreSQL/PostGIS
run #1
25.699707 secs (100 points done)	17.625612 secs (100 points done)
47.867214 secs (200 points done)	34.638656 secs (200 points done)
69.305084 secs (300 points done)	52.237613 secs (300 points done)
92.859412 secs (400 points done)	69.341271 secs (400 points done)
115.454518 secs (500 points done)	86.262825 secs (500 points done)
139.218207 secs (600 points done)	103.739050 secs (600 points done)
160.899055 secs (700 points done)	121.125119 secs (700 points done)
181.347560 secs (800 points done)	138.102620 secs (800 points done)
204.091182 secs (900 points done)	157.784176 secs (900 points done)
226.642999 secs (1000 points done)	176.534620 secs (1000 points done)
run #2
23.127635 secs (100 points done)	17.707309 secs (100 points done)
43.896270 secs (200 points done)	35.165084 secs (200 points done)
67.718290 secs (300 points done)	52.980560 secs (300 points done)
89.345787 secs (400 points done)	70.480220 secs (400 points done)
110.066845 secs (500 points done)	88.414388 secs (500 points done)
130.674735 secs (600 points done)	105.008400 secs (600 points done)
151.385573 secs (700 points done)	121.884277 secs (700 points done)
173.925463 secs (800 points done)	139.641970 secs (800 points done)
195.087116 secs (900 points done)	157.289428 secs (900 points done)
217.820302 secs (1000 points done)	175.215180 secs (1000 points done)
run #3
22.361896 secs (100 points done)	17.439594 secs (100 points done)
44.386435 secs (200 points done)	35.320356 secs (200 points done)
67.019217 secs (300 points done)	52.470219 secs (300 points done)
88.447199 secs (400 points done)	69.500153 secs (400 points done)
109.669691 secs (500 points done)	87.756069 secs (500 points done)
130.935786 secs (600 points done)	104.964921 secs (600 points done)
151.307012 secs (700 points done)	123.235305 secs (700 points done)
172.589960 secs (800 points done)	140.651206 secs (800 points done)
192.807686 secs (900 points done)	158.650197 secs (900 points done)
217.317390 secs (1000 points done)	176.900490 secs (1000 points done)

Parallel mode (20 processes x 30 searches)

MySQL	PostgreSQL/PostGIS
run #1
real 2m5.448s	1m37.288s
run #2
real 1m58.536s	real 1m36.647s
run #3
real 1m59.127s	real 1m30.889s

Averaging these results we have next ones:
Consecutive run (50M of records):

MySQL:	220.593564s or 221ms per point
PostgreSQL/PostGIS:	176.216763s or 176ms per point

Parallel run (50M of records):

MySQL:	121.037s or 202ms per point (600 points in total)
PostgreSQL/PostGIS:	94.941s or 158ms per point (600 pois in total)

Comparing to less dense 100M set you may see that PostgreSQL/PostGIS performed almost at the same rate in consecutive mode, while MySQL was significantly slower increasing the gap up to 25.2% between DBs.

As for parallel mode MySQL retained its performance while PostgreSQL/PostGIS introduced some decrease in performance while still performing faster, now by 27.5%.

Slightly higher density of locations influenced timings comparing to less dense 100M set.

Conclusions

Although it was rather rough comparison because of the way both DBs produced result, I think, it was interesting to see how MySQL and PostgreSQL performed during this benchmark.

Spatial indexes proved to be fast no matter how large is your dataset. Both 50M and 100M datasets seemed to be the same deal for engines and indexes. Both DBs using R-Tree scheme for spatial indexes (PostgreSQL has a special type for it called GIST). What influenced results more is locations density, slowing down overall performance on a smaller dataset.

One more note not to be missed about PostGIS – it allows to have more accurate search results, because it has an option to calculate over spheroid surface rather than simple sphere. This effect will be more noticeable when, suppose, you want to have also a distance to the center of searching radius for all locations.

The lag will be up to several meters if some location nearly fits within 1000 meters radius lookup compared to calculations assuming Earth as an ideal sphere. The bigger radius, the more pronounced effect will be. MySQL doesn’t offers that option. Also, while using MySQL you’ll probably want to put calculating queries into stored procedure to have cleaner look of interaction with DB.

Both DBs will give good results, but it seems that PostgreSQL/PostGIS for the moment might be better, faster and more versatile solution, on the other hand, it has more dependencies to be installed and is a little bit tricky during table’s index creation.

PS

Script used for benchmarking can be found here: https://gist.github.com/2216469