GPUs and Databases

eyalhir74 · November 29, 2009, 12:40pm

Hi,
Anyone aware of a recent/current research/company that tries to offload database calculation/operations
to the GPU?

I’ve looked at google, but couldnt find too much relevant information beside a few articles

thanks
eyal

Gregory_Diamos · November 29, 2009, 3:22pm

Are you looking for a finished product that you could use or research papers and tech reports on the subject?

eyalhir74 · November 29, 2009, 4:21pm

Both :) I’d like to understand whether such a thing is doable and if people tried to do so, what can be the

expected boost.

thanks

eyal

Gregory_Diamos · November 29, 2009, 7:26pm

Database operations are typically extremely memory bound. The ratio of GPU to CPU memory bandwidth is typically around 10:1, so there is an advantage assuming that you can fit a data set entirely into the GPU memory. This might seem like a very big restriction, but we have found some situations where it works well. We have also found that it is easier to achieve close to the hardware limits on memory bandwidth for GPUs than for multi-core CPUs, where we typically have to do very low level programming (explicit prefetch instructions, tuning the memory traversal pattern to the cache organization, being careful that multiple threads do not thrash in the cache) to get reasonable performance.

Many database operations can be trivially partitioned and processed independently, which makes multi-GPU implementations possible. Oracle and others have had success with clustered implementations. The classical examples use hash functions or dimension order reductions to create a distributed data structure where elements with the same keys are always mapped to the same node. This allows even operations like JOIN to be processed in parallel. If you have a cluster with a relatively large number of GPUs, you can process queries on data sets of 10s or 100s of GBs completely in memory completely in parallel. I have worked on a few production level applications that run between 40% and 80% of the peak memory bandwidth on a 285GTX for data sets that actually fit in memory.

For data sets that do not fit in GPU memory, the data has to be streamed in from disk and is limited to the disk bandwidth; it is not slower than the CPU implementation, but not significantly faster either.

None of this is currently published or released open source. It may be eventually, but not before a product is released. If you are a researcher in this area or represent a company that might be interested in this, send me a PM and I could put you in contact with the company that I am working with. They are very supportive of academic collaboration and would also be interested in potential clients or partners. You would probably need to sign an NDA for detailed info though.

eyalhir74 · November 29, 2009, 8:10pm

Database operations are typically extremely memory bound. The ratio of GPU to CPU memory bandwidth is typically around 10:1, so there is an advantage assuming that you can fit a data set entirely into the GPU memory. This might seem like a very big restriction, but we have found some situations where it works well. We have also found that it is easier to achieve close to the hardware limits on memory bandwidth for GPUs than for multi-core CPUs, where we typically have to do very low level programming (explicit prefetch instructions, tuning the memory traversal pattern to the cache organization, being careful that multiple threads do not thrash in the cache) to get reasonable performance.

Many database operations can be trivially partitioned and processed independently, which makes multi-GPU implementations possible. Oracle and others have had success with clustered implementations. The classical examples use hash functions or dimension order reductions to create a distributed data structure where elements with the same keys are always mapped to the same node. This allows even operations like JOIN to be processed in parallel. If you have a cluster with a relatively large number of GPUs, you can process queries on data sets of 10s or 100s of GBs completely in memory completely in parallel. I have worked on a few production level applications that run between 40% and 80% of the peak memory bandwidth on a 285GTX for data sets that actually fit in memory.

For data sets that do not fit in GPU memory, the data has to be streamed in from disk and is limited to the disk bandwidth; it is not slower than the CPU implementation, but not significantly faster either.

None of this is currently published or released open source. It may be eventually, but not before a product is released. If you are a researcher in this area or represent a company that might be interested in this, send me a PM and I could put you in contact with the company that I am working with. They are very supportive of academic collaboration and would also be interested in potential clients or partners. You would probably need to sign an NDA for detailed info though.

Hi Gregory,

Thanks a lot for the information :) I’m not in academia mainly just wondering whether this area could have been a candidate for a startup.

I found it weird that there is no production level product yet. You’ve mentioned Oracle, how come a company of that magnitude don’t

have a working solution with GPUs? do you know if this is something they might release soon? working on?

“Database operations are typically extremely memory bound” - can you point me to an article or a reference that gives

more details on this… I was always under the impression (from our production env benchmarks) that it was usually IO/storage bounded.

when you say: “you can process queries on data sets of 10s or 100s of GBs” do you mean that using ~25 Teslas you can cover

~100GB and thus partition your data over all gpus? is there some referenece/guidelines how to go about implementing such operations?

How many development years would you say be required to create an Oracle plugin, for example, capable of doing the basic stuff

with the GPU? more complex queries/joins/…?

Wow as much as I think of it - its more amazing… :)

thanks a lot :)

eyal

Gregory_Diamos · November 30, 2009, 12:28am

By memory bound I mean that the ratio of arithmetic instructions to memory instruction in a database program will be much lower than in a scientific workload. The memory hierarchy (cache->memory->disk) of a system will be more important to this class of application than the core of the processor. People typically consider database applications to be so large that the datasets necessarily must use the entire memory hierarchy including disk, and thus almost all system designers assume that the entire database will reside on-disk. Even in these cases, main memory is used as a cache, and for a certain class of applications, it is possible to manage the mapping of on-disk data into memory such that the majority of queries are actually serviced out of main memory without ever accessing disk.

Yes, it turns out that there are a significant number of applications that do not require more than a few hundred GB to store a given data set. For example, we have run some of the 100GB TPC-H benchmarks against completely GPU systems. Even if the entire database is several TBs, the set that is actively used is not necessarily this large (for example, queries against the last day or week of records rather than the lifetime of the database).

I haven’t really actively worked with Oracle ( I was just referring to their RAC implementation as an example of a clustered database), so I am not sure exactly how long it would take to implement this as a plugin. The implementation that I worked on was an enigne for executing database queries at a lower level than SQL. It executed an IR that SQL and other languages were compiled into. The existing runtime was something like a 200-500k line codebase of C++. Supporting GPU versions of all of the primitive operations was done in about 30,000 lines of C++ and CUDA. The basic data structures for storing the distributed database in GPU memory were around 10,000 lines of CUDA/C++. It was only so small because most of the infrastructure already existed. For our implementation, it took about 3 engineers working full time and 2 academic collaborators working part time for about a year to get to the point where we could run a few non-trivial applications.

eyalhir74 · November 30, 2009, 5:11am

By memory bound I mean that the ratio of arithmetic instructions to memory instruction in a database program will be much lower than in a scientific workload. The memory hierarchy (cache->memory->disk) of a system will be more important to this class of application than the core of the processor. People typically consider database applications to be so large that the datasets necessarily must use the entire memory hierarchy including disk, and thus almost all system designers assume that the entire database will reside on-disk. Even in these cases, main memory is used as a cache, and for a certain class of applications, it is possible to manage the mapping of on-disk data into memory such that the majority of queries are actually serviced out of main memory without ever accessing disk.

The cache->memory->disk paradigm would be taken care of automaticaly as it is today by the “regular” database?

or would you do some internal changings to the DB engine to make it more GPU friendly?

wow this is amazing… :) could you please specify which DB did you use ?

I haven’t really actively worked with Oracle ( I was just referring to their RAC implementation as an example of a clustered database), so I am not sure exactly how long it would take to implement this as a plugin. The implementation that I worked on was an enigne for executing database queries at a lower level than SQL. It executed an IR that SQL and other languages were compiled into. The existing runtime was something like a 200-500k line codebase of C++. Supporting GPU versions of all of the primitive operations was done in about 30,000 lines of C++ and CUDA. The basic data structures for storing the distributed database in GPU memory were around 10,000 lines of CUDA/C++. It was only so small because most of the infrastructure already existed. For our implementation, it took about 3 engineers working full time and 2 academic collaborators working part time for about a year to get to the point where we could run a few non-trivial applications.

I thought that you meant that Oracle played around with GPUs, RAC is “simple” distributed CPU cluster.

I find it very weird that there is still no production level components provided by Oracle/MS or even nVidia that would offload

this stuff to the GPU - a RAC version with GPUs would be really amazing :)

Thanks a lot Gregory for all the information…

eyal

eyalhir74 · November 30, 2009, 6:58am

Hi Gregory,
Another question, please :)
I guess DB applications can be roughly divided to OLTP and dataware house applications. Is there any reason
for GPUs in OLTP applications?
In dataware house applications, for example, I’d want to sum the amount of products sold today/this week/whatever,
this again, I think, falls under what you’ve wrote in a previous post - bring the data from the disk to memory and to cache
and then do the sum/max/min/… - here again the GPU will probably wont have too much effect on performance.
This leaves us with only a dataware house applications where the user/s are doing repeatdly queries roughly on the same or
mosy used data and the data already resides in memory? Is that correct?
Do you know the % of such applications that can actually benefit from the GPUs?

Also, how does RAC fit in? again for dataware house applications using RAC? OLTP is out of the GPU game?

Turned out to be more than one question :)

thanks again
eyal

Sarnath · November 30, 2009, 7:49am

I remember a GPU implementation for some SQL joins - it was a paper… They reported 2x to 7x performance increase… But I dont know the details…

btw, also look @ netezza in Datawarehousing space. They have the storage centric compute which they call OnStream paradigm…

Gregory_Diamos · November 30, 2009, 6:03pm

I think you mean this paper: http://www.cse.ust.hk/catalac/papers/gpujoin_sigmod08.pdf . It is possible to do better than this.

Gregory_Diamos · November 30, 2009, 7:05pm

As you mention, OLTP applications would be very difficult to accelerate due to the distributed nature of the system and the relatively small volume of data; most of the time is spent in small queries and updates to the database.

Yes I would agree with that statement. My argument would be that the volume of data that can fit in memory is large enough to handle many real world applications.

In my opinion, the biggest advantages of GPUs are for business intelligence and data forecasting applications that combine analysis with queries.

The basic idea behind clustered databases is that the database is distributed across many nodes, all of which can operate in parallel. For systems with disks, this means that all nodes can access their individual disks in parallel. Basic operations like joins on the same key space, select, project, etc can be trivially parallelized. More complex operations like joins on different key spaces e.g. JOIN( R(k,v), S(v,s) ) = N(k,s) potentially require data to be exchanged before the nodes can proceed in parallel.

I think that clustered implementations can be applied with equal effectiveness to any database application. It really only makes sense if you need either 1) more data than can be stored in a single node, or 2) more bandwidth than can be supplied by a single node.

heshsham_India · December 3, 2009, 7:59am

The following paper might be helpful here: [url=“404 - Page Not Found | Institute of Computer Science-FORTH”]404 - Page Not Found | Institute of Computer Science-FORTH
They have developed a Network Intrusion detection system using CUDA on GPUs, which frequently links with a database

Topic		Replies	Views
CUDA Benchmark Suite Suggest Apps CUDA Programming and Performance	15	30007	July 3, 2009
How much interest is there in relational algebra on GPUs? CUDA Programming and Performance	7	17119	February 12, 2011
MapD: Massive Throughput Database Queries with LLVM on GPUs Technical Blog	17	338	May 14, 2017
What to Do with All That Bandwidth? GPUs for Graph and Predictive Analytics Technical Blog	2	376	August 14, 2017
very large data set (big matrix) CUDA Programming and Performance	10	3025	October 17, 2009
Library for database access GPU-Accelerated Libraries	0	499	January 25, 2017
PCIe Impact Give some examples of how PCIe impact your applications CUDA Programming and Performance	15	2282	October 17, 2010
Any external sort or DB implementation in CUDA? CUDA Programming and Performance	0	13170	March 7, 2011
Benchmarking CUDA database performance CUDA Programming and Performance	4	3413	November 17, 2012
CUDA area of application CUDA Programming and Performance	8	9197	February 20, 2010

GPUs and Databases

Related topics