b+tree CUDA implementation

Hi All,

I am looking for a B+tree implementation in CUDA. Perhaps somebody heard something. Internet searches generally return nothing.


I tried implementing one of these about two years ago back in the gt100 (8800GTX) days (before atomics). The conclusion that I came to was that B+/-/etc trees were optimized for a series of sequential accesses and not suitable for a large number of concurrent accesses from threads executing on a GPU. Most approaches using Btrees could be transformed into either binary searches over dense sorted arrays or bulk operations that consumed a partitioned array, transformed it, and returned a new array. I don’t think that this changes much today as atomics are still slow compared to uncontested memory accesses.

Thanks for reply.
I already implemented parts of B+Tree and realized that atomics and threadfence is unfortunately necessary.
However, I need to search millions of records and in the same time I need to insert new records as quickly as possible.
Dense sorted array is very good in searching but I believe that much slower in inserting, plus I have no memory to create a new array by copying the old one.
B+Tree is the only solution which suits my needs. Or maybe there is something else?

You might take a look at a paper from SIGMOD last year:

FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. [ pdf ]

They explore designing basic binary trees for this type of operation on GPUs and multicore CPUs.

At a high level, I would say that getting high performance out of a B+ tree (or related indexing structure) on a GPU is definitely challenging, as it involves conditional computation.

Thanks, I know this paper, however I was looking for some ready to use library.

Now, I have almost finished implementing a B+tree in CUDA already. I just must find some time for final testing and tunings. For example finding the optimal page size. I also see many places for optimization like removal of all atomics. It is possible that I publish some results within several weeks.

I tried to move most of the conditional code to the host assuming that it is better to run several short kernels than a bigger one with many long branches inside.

It is then challenging to minimize PCI transfer with this kind of architecture, etc. I think a lot of research must be still done in this field.

@Ramboo: Thanks for input on B+ tree implementation in CUDA. I am too looking for a suitable B+ tree implementation in CUDA which will be connected to Hash Map to query a very large database of RDF records. Have you been able to get a suitable B+ tree implemented using CUDA. Would like to hear from you on the experience, is it worth going for CUDA implementation??