CUDA for SQL Group By Can CUDA perform SQL Group by efficiently?

Tekka · April 25, 2012, 5:52am

Is it at all possible for CUDA to perform SQL Group by like feature? if so, then how would it be done and would it be fast enough to be more beneficial than using SQL?

for example 150,000 rows of decimal values:

Combination 1: Column1 & Column2
SELECT Column1, Column2, COUNT(*), COUNT(Column3), Avg(Column4)
FROM Table
GROUP BY Column1, Column2

Combination 2: Column1 & Column3
SELECT Column1, Column3, COUNT(*), COUNT(Column4), Avg(Column5)
FROM Table
GROUP BY Column1, Column3

Currently SQL would average about 800 ms for each group but since there are 72 columns, there are about 800,000 combination possibilities of all columns for 2 to 4 columns. As you can see with some math, it would take about 7 days to complete.

CUDA has a similar example called Histogram which only counts the number of occurrence, which is almost like a basic group by with a COUNT(*).

Any help on how to accomplish this with CUDA would be grateful.

Gregory_Diamos · April 25, 2012, 9:50pm

This could work well on CUDA, but the choice of using a histogram or using a sort and segmented reduction will depend on how many histogram buckets you will end up with. If it is small (<<1000), use the histogram, if it is large, (>>1000) then try sorting the rows by the columns that you are reducing over to create a lot of small buckets and then do a reduction within each bucket.

parallelis · April 26, 2012, 7:40pm

What interests you is to count(*) the number of entries corresponding to each (columnA, columnB) couple. 150 000 rows with 2 doubles (2x8 Bytes) on each, so 2.2MB of data.
I would bet on single-threaded processing by the CPU, due to the CPU cache enabling the whole dataset to be cached. And instead sorting, I would go for a hash → 2 values + counter, that may fit into 4MB of cache (any basic CPU).

1 direct loop, no multithreading, NO SORTING, you just build your hashtable, adding a new element (count=1) or incrementing count when older value is found. I bet it will be faster than sending the data back and forth the GPU, not talking about executing a single line of Kernel code!

If you had 1GB+ of data, I would have chosen the GPU, where GPU memory subsystem would make it far faster than it’s CPU counter part, and maybe having as many counter per entry as the number of CUDA devices, to avoid atomic on global memory :)

Gregory_Diamos · April 26, 2012, 10:13pm

I might have interpreted the question incorrectly, but I thought he was accessing permutations of columns from a table with 72 columns, so the array size would be 72150,00016 = ~165MB.

“Currently SQL would average about 800 ms for each group but since there are 72 columns”

parallelis · May 1, 2012, 3:09pm

That’s right, the GPU would have been my choice for a much larger dataset!

Lars67 · May 1, 2012, 3:45pm

Perhaps take a look at Alenka GPU db. I belive this such functionality and is opensource.

parallelis · May 3, 2012, 6:32pm

I just took a look and it’s extremely promising for little databases (few GB) with huge concurrent access, would like to try it on a staging server to see how it compares with MongoDB!

Gregory_Diamos · May 3, 2012, 9:14pm

Alenka is currently built on thrust primitives. We’re hoping to make the individual set and join operations faster by a factor of 10 compared to what is currently in thrust.