MapD: Massive Throughput Database Queries with LLVM on GPUs

Originally published at: https://developer.nvidia.com/blog/mapd-massive-throughput-database-queries-llvm-gpus/

Note: this post was co-written by Alex Şuhan and Todd Mostak of MapD. At MapD our goal is to build the world’s fastest big data analytics and visualization platform that enables lag-free interactive exploration of multi-billion row datasets. MapD supports standard SQL queries as well as a visualization API that maps OpenGL primitives onto SQL…

Hello, very interesting post. Quick question, as I haven't used MapD before, does all the data need to be somehow stored on the GPUs in GPU memory to get these speed ups or do these timings include the standard costs of transferring data from say harddisk to the computation unit

Hi Rob,

One of the authors here. We don't include costs of transferring data to GPU memory because our architecture focuses on caching the hot or most recently used data on the GPUs, which can have up to 192GB of memory on an 8 K80 server (and likely much more with future generations of GPUs). Also since we are a column store we can do significant compression on the data as well as only caching queried columns on the GPU. Whatever data doesn't fit on GPU can be cached in the likely larger CPU RAM. So in most of our typical use cases the data is only read once at startup from SSD. The compressed working set typically fits in GPU RAM and queries are blazing fast. The result sets are usually quite small so the overheads of moving the results back to the CPU are typically negligible.

The numbers in figure 2 are pretty disingenuous. You are comparing >$40k in GPUs against what could be <$2k in CPU hardware.

A dual-socket Xeon server with decent clocks and at least 200GB of RAM will be at least 5-6K, and with a RAID array of fast SSDs lets say 8K. And we can basically run as fast (albeit with half the VRAM) on a server with eight $1K Titan X cards. The point here is not to be deceptive about price but to show the amazing performance that can be achieved on a single server with GPUs. To match the performance you'd have to get at least a full rack of CPU servers (running fast software like MapD and not your usual in-memory databases) which has lots of extra costs of its own (rack space, extra power) - plus you'd probably not be able to scale linearly due to the network overheads of being distributed. So we think that for customers who need to support multiple users running interactive analytics on relatively big datasets, our system with GPUs is the both the most performant and cost effective solution available.

How does your current system scale beyond one box?

Also, if one were to buy a setup like you mention on AWS:

A G2.8xlarge comes contains 16GB vram and will cost almost $1,900 per month. So 192GB of VRAM you mention will cost almost $23,000 per month!

An R3.8xlarge system with 244GB ram and 32 cores only costs $2,000 per month.

How then does it make sense to run MapD in that case considering the enormous cost difference?

if you buy on AWS there is no cost of rack space or power...

Interesting!

Recently in the PostgreSQL world, there is some talk of GPU acceleration of queries and JIT compilation for something called schema-binding, the former of which is more-or-less possible with an extension called PG-Strom and an approach for the latter was recently discussed at PGCon 2015.

Obviously, there is a long way to go...

I disagree about linear scaling. Scaling linearly is easy if you push down aggregation. I can run many nodes for the star schema benchmark, for example, and there are only ever 600 rows per server at most sent between the machines (because they are aggregated already) - it is a shared nothing embarrassingly parallel system.

For 5B row table, for example, and 20 nodes, that is 250M rows per node, which can easily fit in ram, and I can aggregate it with all say 64 cores in the box, over all 20 boxes. Those nodes don't have to have much disk, CPU and RAM are fairly cheap. Not sure how that competes with your solution.

It is called Shard-Query and it meets/beats RedShift at 200GB SSB (largest tested)

Actually those are baked in. nothing in life is free.

There is also Alenka on github, which is interesting.

Hi Justin you have a good point. For low-cardinality group-by the amount of data sent between nodes should be small and sharding between nodes should work quite well as long as the network communication is efficient. (My experience is that existing distributed dbs often see an appreciable slowdown for network I/O even when the data sizes exchanged should be trivial.) As the cardinality goes up though or you start looking at joins it is more advantageous to be on a single super-node like ours. Of course we're planning distributed support as well, and for that we'll be bound by the same laws of network physics as everyone else. :)

Shard-Query looks nice, how does it compare to something like MemSQL for analytics workloads?

Hi,

Few points:
Even if there are 1 million aggregate rows in a query, it is still much more efficient than transit of billions of rows over network. Network physics of 10Gbit ethernet hardly matter at those scales. I doubt there are group by into the billions.

Shard-Query duplicates non-sharded tables on all nodes, so you can use a star or snowflake schema. It does not support joining tables sharded on different keys together. Use hadoop or redshift for that, it is not my niche.

I do analytics with ICE (infobright community edition) which is a compressing column store with a in-memory metadata system that replaces indexes. It is quite fast, as I said, competes with RedShift, and Shard-Query is just as petabyte scale as RedShift is.

I even got congrats on my benchmark against RedShift from the manager of the RedShift team.

Shard-Query also implements window functions for MySQL, which are not supported natively.

Softlayer has cloud systems available with NVIDIA K80 GPUs, fwiw. You can get a machine with two K80's = 24 GB of VRAM for under $2K per month.
Not the 192 you guys are talking about but enough to get started perhaps.

Actually 2 K80s = 48GB of VRAM.

Right on -- and I should know, eh?!
One K80 = 24 GB folks...

I'm just curious if you have any thoughts about trying FPGAs for acceleration on MapD project

can I run using nvidia geforce 840m which is based on maxwell architecture?