What kind of data structures do you use?


I am just wondering what kind of data structures do people use while using CUDA?

I am sure, I can think of single & multi-dimensional arrays of int/floats/other data structures…

I am just curious if any of you have ever used any of the following data-structures inside GPU?

  1. Linked list / single / double
  2. circular lists
  3. Hash tables
  4. Adjacency lists for graphs, trees
  5. Complex data structures linked by pointers, requiring deep-copies…

Appreciate your time,

Thanks for your time,

Best Regards,

I have used linked lists on CPU in FORTRAN, I plan to use some tree based stuff in my next cuda code… apart from that… none…


Thank you Nitin…

I believe linked lists and other such data structures dont find themselves much in HPC arena…

It is all arrays and multi-Dimensional stuff, I guess

Interesting question!!!
For myself I need to use FIFO queues, hashtables and other kinda memory structures in CUDA.

But as they are stored in Global Memory, it hurt performance badly without using many many tricks.

WHy are you asking about them?

And, thats for the chess thing…, right? (from your signature)

Interesting that you get to use all these…


Was this application developed from scratch on GPU ? ( I would bet on this one)


Did you port a CPU implementation?

Yeah, it must have been tough. These structures are not parallel friendly… OR may be, you need to have a parallel friendly version of these data-structures…

I am just designing something… So, I was just looking - what features would matter and what not? Its more like a survey to find what a potential user would need.

When I am done, we will make an alpha release in the forum. Let us see how that one goes…

THANKS a lot for answering!

Binary trees more than anything, but also circular buffers (not lists) and hash tables (of sorts).

For what it’s worth, all of the above I implement as structures wrapping data in a texture - except the circular buffer.
(1D texture for binary trees, 2D texture for hash tabley things)


Thanks for your answer.

I guess you use Textures for the read-only data structures. Also, The pointer chasing looks to be easened out by textures…

And the circular buffer must have been a RW data-structure. Is there any other reason why you don’t use texture for accessing circular buffer?

And, I guess you copy over the binary tree/hash table from host to GPU memory. That must have been terrible work, Isn’t it?
Do you replicate the binary tree as such in the GPU memory OR Do you use a different form of it?

btw, What is the kind of problem that you are solving? I am interested to know what kind of HPC problems require all these stuff…

Thanks a LOT for your time,

Best Regards,

No other reason than the fact my circular buffers are always RW (larger ones are in gmem/lmem, I also use smaller ones as a cache of sorts - in smem).

And yes, I construct the tree on the CPU - and copy it over to the GPU - I specifically designed these tree structures s.t. the memory layout is identical between the CPU/GPU (to the benefit of the GPU, the CPU has to do extra work to layout memory in this way) - so copying from CPU to GPU is just a single copy - not that bad - and these are constructed once-off anyway (so the copying overhead isn’t an issue) - however I do have plans to actually build these trees on the GPU as well - when the need arises for me to rebuild these trees often.

I’m mostly dealing with real time physics simulations and computer graphics (ray tracing) at home, and my day job is computer vision & machine learning work (also CUDA) - but CV doesn’t really use too many special data structures, machine learning uses hash tables among other things… depending on what you’re doing.


Do you construct the tree inside a big array and use “index count” as the next-pointer. so, you really dont need to mess around with pointers. Is that right?

Do you construct your hash tables that way too?

This way you can avoid deep-copying of data structures… Is that right?

I see. Interesting! Thanks!

I’ve been using the following:

[codebox]struct xs


//these are micros

float * rad_cap;

float * fission;

float * elastic;

float * inelastic;

float * Emesh;

legendres coeffs;

float a,b,c,d;

int mesh_length;


struct xsinfo


xs* cross_sec;

int *list;

int listlength;


With deep copies. This came because I was mostly porting a code, I think I would have done this differently had I started from scratch on the gpu though. Man it was a pain to copy that data structure to the gpu, including the data being pointed to by the xs*…

This is exactly the kind of stuff that I was looking for… Deep copies… They can really fry the programmer :-)

struct xsinfo


   xs* cross_sec;	

   int *list;	

   int listlength;


In the structure above, how many entries did “cross_sec” have? Was it an array of size “listLength”? OR Just a reference to a single structure?

Can you tell what kind of application it is (like gaming, FEA or physics simulation etc…)?

THANKS a LOT for posting!

And, I have the same exact opinion as you. Only porting projects would pose such challenges. Thanks!

Best Regards,


Yes, this is for a Chess Engine in CUDA.

It is NOT a try to port classical CPU Chess Engine to GPU, but instead to port classical Chess Engines Algorithms on a GPU, with algorithms devised for the GPU, and multi-threading organized around the different context-level of a CUDA GPU (Scalar Processor/Multi Processor/GPU/Multi-GPU on a PC), their weaknesses and incredible strengths.

Hashtable is classical in chess computer, so I need it to accelerate computations in end-game situations. Or at least find an implementation that will be stisfying in term of performance on a GPU :-)

And I need FIFO Queues of all sorts for thread communication, trying to avoid using Atomic functions as much as possible (too ssllloooowwwwww).

More or less, my texture is basically one big array of ‘nodes’ in the tree, and instead of having left/right child pointers, I have indices (which reference the texture).

Avoids pointer issues - making the code more portable overall (can copy the same data structure between almost any architecture seamlessly with no intermediate fixes required, be it CPU<->CUDA<->OpenCL).

Not quite, the hash tables are actually generated on the GPU (using smem/gmem), and then later on in another kernel referenced as textures. But I still use the same methodology as I do with the trees (avoiding pointers, using indices).

Yup! :) But the down side is, depending on the structure - it’s not easy to add/remove elements - while maintaining any sane ordering of elements to take advantage of the texture cache (as I know how I’m going to index the data, order-wise).

Hi Sarnath.

Trust me, I’ve been frying, all weekend and today to be exact. This is a physics simulation code.

Cross_sec is malloc’ed (cudamalloc) to be < 20 (could be larger but I dont think I’ll ever use more), and each of those <20 elements then points to a xs struct. These xs struct member vars (fission, rad_cap, etc…) are malloced to be anywhere from 100-33,000 elements long.

Any more information I can give you?



Thanks for your answers… While using atomics, one need to make sure that one representative thread from each block fights the block-level lock.
And then threads inside the block, fight for a shared memory lock before one thread finally gets exclusive access.
This way, it works faster. If you allow all threads to fight for the lock, it will be very very sloooooooooooowwww. I am sure you know this info. Just my 2 cents…

Nice to know. It is really a smart way of doing things!

@Adam (laxsu19),
Thanks for the info. I just wanted to know if there are people suffering from this problem… Good to know there are areas which require all these deeeeeep-copies. Good Luck!

Thanks a lot for all your time!

Best Regards,