Build (or) Send a Tree on/to the GPU


I am currently looking at tree and have the following structure

typedef struct nd{

    short int final;
    char number; 
    struct nd *child[2]; 


So my plan was to build the tree on the GPU, I get how to allocate memory, since I know exactly how many nodes I will have in there and can therefor avoid synchronisation, although I don’t see how to build the tree on the GPU.

Can I simply make some code like :

__global__ void Tree((node * tree, int size)) {

--> That's the part where I'm actually stuck.  How can I build my tree here 


int main(int argc, char **argv){


        cudaMalloc( (void**)&treeCUDA, sizeof(node)*256*256 );

	Tree<<<1,256>>>(treeCUDA, size);



Thanks for the help, I have also read that a tree could be contained in an array, are there any examples anywhere on how to do that ?


The tree-on-top-of-array approach seems worth exploring. In the context of some embedded software, where lists had to maintain excellent spatial locality for performance reasons, as well as being fast to save and restore, I have in the past successfully used a list-on-top-of-array approach.

By the same token, trees can be mapped to arrays. For a binary tree, Wikipedia describes one simple storage scheme that can be used (one can of course devise various alternate schemes):

Note I have not used trees on the GPU in my own work, so I am not claiming the tree-on-top-of-array approach is the way to go, just that it merits investigation.

Hi njuffa, I have looked at that as well, but I might have up to 256 children in the future, though I will implement both approaches to see what gives the best performances.

Thanks for the reply.

I don’t think you can actually build a tree on the device, can you? I thought cudaMalloc() was a host function only so the only thing you could do is allocate your tree on the device through the host and then process your allocated memory with the device.

(1) For information on device-side malloc() and free(), check out the CUDA C Programming Guide at

B.18. Dynamic Global Memory Allocation and Operations
Dynamic global memory allocation and operations are only supported by devices of compute capability 2.x and higher. […]

(2) The building of trees or other data structures does not necessarily require any kind of dynamic memory allocation. For example, programmers were doing in Fortran 77, i.e. back when Fortran had no support for dynamic memory allocation.

Oh wow, I’m a nublet.

Hey, thanks though. I’ma actually try making a binary tree on my GPU now (c.c. 2.1 so I can) and I’m probably just gonna use the array representation as c.c. 2.1 can’t support recursive function calls. Unless it can…?

I was under the impression that the introduction of an ABI with sm_2x, which provides function calling conventions and a stack should allow the use of recursive functions, but I have not tried it and will defer to the documentation. You may need to increase the default stack size depending on the depth of the recursion. I hazard a guess that recursive functions are probably not the fastest way to traverse a tree. The compiler may convert end-recursion into iteration, but I have not looked into that.

I think I’ve tried it before and it told me I need an architecture of 3.5 or higher. But you might be right, Idk. I’ll look into it.

Best I can determine, recursion has been supported since CUDA 3.1, on sm_2x hardware and up:

•New language features added to CUDA C / C++ include:
•Support for printf() in device code
•Support for function pointers and recursion make it easier to port many existing algorithms to Fermi GPUs

Well, if it is a binary tree, it is easier to use an array representation.

but to answer some questions, the easiest way is to create the tree on the host as an array and send the array, I’m actually working on that right now.