MapD: Massive Throughput Database Queries with LLVM on GPUs

jwitsoe · June 23, 2015, 10:38am

Originally published at: https://developer.nvidia.com/blog/mapd-massive-throughput-database-queries-llvm-gpus/

Note: this post was co-written by Alex Şuhan and Todd Mostak of MapD. At MapD our goal is to build the world’s fastest big data analytics and visualization platform that enables lag-free interactive exploration of multi-billion row datasets. MapD supports standard SQL queries as well as a visualization API that maps OpenGL primitives onto SQL…

anon70394782 · June 23, 2015, 7:10pm

Hello, very interesting post. Quick question, as I haven't used MapD before, does all the data need to be somehow stored on the GPUs in GPU memory to get these speed ups or do these timings include the standard costs of transferring data from say harddisk to the computation unit

anon51932981 · June 23, 2015, 8:50pm

Hi Rob,

One of the authors here. We don't include costs of transferring data to GPU memory because our architecture focuses on caching the hot or most recently used data on the GPUs, which can have up to 192GB of memory on an 8 K80 server (and likely much more with future generations of GPUs). Also since we are a column store we can do significant compression on the data as well as only caching queried columns on the GPU. Whatever data doesn't fit on GPU can be cached in the likely larger CPU RAM. So in most of our typical use cases the data is only read once at startup from SSD. The compressed working set typically fits in GPU RAM and queries are blazing fast. The result sets are usually quite small so the overheads of moving the results back to the CPU are typically negligible.

anon73871384 · June 24, 2015, 4:31am

The numbers in figure 2 are pretty disingenuous. You are comparing >$40k in GPUs against what could be <$2k in CPU hardware.

anon51932981 · June 24, 2015, 5:39am

A dual-socket Xeon server with decent clocks and at least 200GB of RAM will be at least 5-6K, and with a RAID array of fast SSDs lets say 8K. And we can basically run as fast (albeit with half the VRAM) on a server with eight $1K Titan X cards. The point here is not to be deceptive about price but to show the amazing performance that can be achieved on a single server with GPUs. To match the performance you'd have to get at least a full rack of CPU servers (running fast software like MapD and not your usual in-memory databases) which has lots of extra costs of its own (rack space, extra power) - plus you'd probably not be able to scale linearly due to the network overheads of being distributed. So we think that for customers who need to support multiple users running interactive analytics on relatively big datasets, our system with GPUs is the both the most performant and cost effective solution available.

anon85197588 · June 24, 2015, 5:46am

How does your current system scale beyond one box?

Also, if one were to buy a setup like you mention on AWS:

A G2.8xlarge comes contains 16GB vram and will cost almost $1,900 per month. So 192GB of VRAM you mention will cost almost $23,000 per month!

An R3.8xlarge system with 244GB ram and 32 cores only costs $2,000 per month.

How then does it make sense to run MapD in that case considering the enormous cost difference?

anon85197588 · June 24, 2015, 5:52am

if you buy on AWS there is no cost of rack space or power...

anon63419154 · June 24, 2015, 6:10am

Interesting!

Recently in the PostgreSQL world, there is some talk of GPU acceleration of queries and JIT compilation for something called schema-binding, the former of which is more-or-less possible with an extension called PG-Strom and an approach for the latter was recently discussed at PGCon 2015.

Obviously, there is a long way to go...

anon76701055 · June 24, 2015, 8:14am

I disagree about linear scaling. Scaling linearly is easy if you push down aggregation. I can run many nodes for the star schema benchmark, for example, and there are only ever 600 rows per server at most sent between the machines (because they are aggregated already) - it is a shared nothing embarrassingly parallel system.

For 5B row table, for example, and 20 nodes, that is 250M rows per node, which can easily fit in ram, and I can aggregate it with all say 64 cores in the box, over all 20 boxes. Those nodes don't have to have much disk, CPU and RAM are fairly cheap. Not sure how that competes with your solution.

It is called Shard-Query and it meets/beats RedShift at 200GB SSB (largest tested)

anon76701055 · June 24, 2015, 8:16am

Actually those are baked in. nothing in life is free.

anon76701055 · June 24, 2015, 8:18am

There is also Alenka on github, which is interesting.

anon51932981 · June 24, 2015, 3:31pm

Hi Justin you have a good point. For low-cardinality group-by the amount of data sent between nodes should be small and sharding between nodes should work quite well as long as the network communication is efficient. (My experience is that existing distributed dbs often see an appreciable slowdown for network I/O even when the data sizes exchanged should be trivial.) As the cardinality goes up though or you start looking at joins it is more advantageous to be on a single super-node like ours. Of course we're planning distributed support as well, and for that we'll be bound by the same laws of network physics as everyone else. :)

Shard-Query looks nice, how does it compare to something like MemSQL for analytics workloads?

anon76701055 · June 24, 2015, 9:19pm

Hi,

Few points:
Even if there are 1 million aggregate rows in a query, it is still much more efficient than transit of billions of rows over network. Network physics of 10Gbit ethernet hardly matter at those scales. I doubt there are group by into the billions.

Shard-Query duplicates non-sharded tables on all nodes, so you can use a star or snowflake schema. It does not support joining tables sharded on different keys together. Use hadoop or redshift for that, it is not my niche.

I do analytics with ICE (infobright community edition) which is a compressing column store with a in-memory metadata system that replaces indexes. It is quite fast, as I said, competes with RedShift, and Shard-Query is just as petabyte scale as RedShift is.

I even got congrats on my benchmark against RedShift from the manager of the RedShift team.

Shard-Query also implements window functions for MySQL, which are not supported natively.

anon50692905 · July 14, 2015, 5:53pm

Softlayer has cloud systems available with NVIDIA K80 GPUs, fwiw. You can get a machine with two K80's = 24 GB of VRAM for under $2K per month.
Not the 192 you guys are talking about but enough to get started perhaps.

anon51932981 · July 17, 2015, 4:49am

Actually 2 K80s = 48GB of VRAM.

anon50692905 · July 19, 2015, 2:41pm

Right on -- and I should know, eh?!
One K80 = 24 GB folks...

anon32121702 · May 31, 2016, 9:05am

I'm just curious if you have any thoughts about trying FPGAs for acceleration on MapD project

anon16752504 · May 14, 2017, 8:24pm

can I run using nvidia geforce 840m which is based on maxwell architecture?

Topic		Replies	Views
Fast, Flexible Allocation for NVIDIA CUDA with RAPIDS Memory Manager Technical Blog	9	946	March 27, 2021
Device memory latency and lookup tables - please advise CUDA Programming and Performance	26	6960	February 28, 2009
Lots of small matrices CUDA Programming and Performance	27	73353	September 23, 2011
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16310	January 30, 2011
What's new in Maxwell 'sm_52' (GTX 9xx) ? CUDA Programming and Performance	69	26926	December 23, 2014
Technical questions on GTX1080ti multiplication CUDA Programming and Performance	14	1950	November 11, 2017
Fastest matrix-vector multiplication? CUDA Programming and Performance	24	4018	May 21, 2011
vGPU for AutoCAD/RDSH questions General Discussion	17	12445	December 10, 2020
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37432	August 30, 2009
Boosting Application Performance with GPU Memory Prefetching Technical Blog	7	1156	March 9, 2023

MapD: Massive Throughput Database Queries with LLVM on GPUs

Related topics