What kind of data structures do you use?

Sarnath · July 27, 2009, 10:09am

All,

I am just wondering what kind of data structures do people use while using CUDA?

I am sure, I can think of single & multi-dimensional arrays of int/floats/other data structures…

I am just curious if any of you have ever used any of the following data-structures inside GPU?

Linked list / single / double
circular lists
Hash tables
Adjacency lists for graphs, trees
Complex data structures linked by pointers, requiring deep-copies…

Appreciate your time,

Thanks for your time,

Best Regards,
Sarnath

nitin.life · July 29, 2009, 9:26am

I have used linked lists on CPU in FORTRAN, I plan to use some tree based stuff in my next cuda code… apart from that… none…

thanks

Sarnath · July 29, 2009, 1:15pm

Thank you Nitin…

I believe linked lists and other such data structures dont find themselves much in HPC arena…

It is all arrays and multi-Dimensional stuff, I guess

parallelis · July 29, 2009, 5:15pm

Interesting question!!!
For myself I need to use FIFO queues, hashtables and other kinda memory structures in CUDA.

But as they are stored in Global Memory, it hurt performance badly without using many many tricks.

WHy are you asking about them?

Sarnath · July 30, 2009, 4:32am

And, thats for the chess thing…, right? (from your signature)

Interesting that you get to use all these…

btw,

Was this application developed from scratch on GPU ? ( I would bet on this one)

or

Did you port a CPU implementation?

Yeah, it must have been tough. These structures are not parallel friendly… OR may be, you need to have a parallel friendly version of these data-structures…

I am just designing something… So, I was just looking - what features would matter and what not? Its more like a survey to find what a potential user would need.

When I am done, we will make an alpha release in the forum. Let us see how that one goes…

THANKS a lot for answering!

Smokey · August 2, 2009, 11:59pm

Binary trees more than anything, but also circular buffers (not lists) and hash tables (of sorts).

For what it’s worth, all of the above I implement as structures wrapping data in a texture - except the circular buffer.
(1D texture for binary trees, 2D texture for hash tabley things)

Sarnath · August 3, 2009, 5:11am

Smokey,

Thanks for your answer.

I guess you use Textures for the read-only data structures. Also, The pointer chasing looks to be easened out by textures…

And the circular buffer must have been a RW data-structure. Is there any other reason why you don’t use texture for accessing circular buffer?

And, I guess you copy over the binary tree/hash table from host to GPU memory. That must have been terrible work, Isn’t it?
Do you replicate the binary tree as such in the GPU memory OR Do you use a different form of it?

btw, What is the kind of problem that you are solving? I am interested to know what kind of HPC problems require all these stuff…

Thanks a LOT for your time,

Best Regards,
Sarnath

Smokey · August 3, 2009, 6:31am

No other reason than the fact my circular buffers are always RW (larger ones are in gmem/lmem, I also use smaller ones as a cache of sorts - in smem).

And yes, I construct the tree on the CPU - and copy it over to the GPU - I specifically designed these tree structures s.t. the memory layout is identical between the CPU/GPU (to the benefit of the GPU, the CPU has to do extra work to layout memory in this way) - so copying from CPU to GPU is just a single copy - not that bad - and these are constructed once-off anyway (so the copying overhead isn’t an issue) - however I do have plans to actually build these trees on the GPU as well - when the need arises for me to rebuild these trees often.

I’m mostly dealing with real time physics simulations and computer graphics (ray tracing) at home, and my day job is computer vision & machine learning work (also CUDA) - but CV doesn’t really use too many special data structures, machine learning uses hash tables among other things… depending on what you’re doing.

Sarnath · August 3, 2009, 6:42am

Thanks!

Do you construct the tree inside a big array and use “index count” as the next-pointer. so, you really dont need to mess around with pointers. Is that right?

Do you construct your hash tables that way too?

This way you can avoid deep-copying of data structures… Is that right?

I see. Interesting! Thanks!

laxsu19 · August 3, 2009, 12:01pm

I’ve been using the following:

[codebox]struct xs

{

//these are micros

float * rad_cap;

float * fission;

float * elastic;

float * inelastic;

float * Emesh;

legendres coeffs;

float a,b,c,d;

int mesh_length;

};

struct xsinfo

{

xs* cross_sec;

int *list;

int listlength;

};[/codebox]

With deep copies. This came because I was mostly porting a code, I think I would have done this differently had I started from scratch on the gpu though. Man it was a pain to copy that data structure to the gpu, including the data being pointed to by the xs*…

Sarnath · August 3, 2009, 12:07pm

This is exactly the kind of stuff that I was looking for… Deep copies… They can really fry the programmer :-)

struct xsinfo

{	

   xs* cross_sec;	

   int *list;	

   int listlength;

};

In the structure above, how many entries did “cross_sec” have? Was it an array of size “listLength”? OR Just a reference to a single structure?

Can you tell what kind of application it is (like gaming, FEA or physics simulation etc…)?

THANKS a LOT for posting!

And, I have the same exact opinion as you. Only porting projects would pose such challenges. Thanks!

Best Regards,

Sarnath

parallelis · August 3, 2009, 5:59pm

Yes, this is for a Chess Engine in CUDA.

It is NOT a try to port classical CPU Chess Engine to GPU, but instead to port classical Chess Engines Algorithms on a GPU, with algorithms devised for the GPU, and multi-threading organized around the different context-level of a CUDA GPU (Scalar Processor/Multi Processor/GPU/Multi-GPU on a PC), their weaknesses and incredible strengths.

Hashtable is classical in chess computer, so I need it to accelerate computations in end-game situations. Or at least find an implementation that will be stisfying in term of performance on a GPU :-)

And I need FIFO Queues of all sorts for thread communication, trying to avoid using Atomic functions as much as possible (too ssllloooowwwwww).

Smokey · August 4, 2009, 12:22am

More or less, my texture is basically one big array of ‘nodes’ in the tree, and instead of having left/right child pointers, I have indices (which reference the texture).

Avoids pointer issues - making the code more portable overall (can copy the same data structure between almost any architecture seamlessly with no intermediate fixes required, be it CPU<->CUDA<->OpenCL).

Not quite, the hash tables are actually generated on the GPU (using smem/gmem), and then later on in another kernel referenced as textures. But I still use the same methodology as I do with the trees (avoiding pointers, using indices).

Yup! :) But the down side is, depending on the structure - it’s not easy to add/remove elements - while maintaining any sane ordering of elements to take advantage of the texture cache (as I know how I’m going to index the data, order-wise).

laxsu19 · August 4, 2009, 2:11am

This is exactly the kind of stuff that I was looking for… Deep copies… They can really fry the programmer :-)
struct xsinfo

{	

   xs* cross_sec;	

   int *list;	

   int listlength;

};
In the structure above, how many entries did “cross_sec” have? Was it an array of size “listLength”? OR Just a reference to a single structure?

Can you tell what kind of application it is (like gaming, FEA or physics simulation etc…)?

THANKS a LOT for posting!

And, I have the same exact opinion as you. Only porting projects would pose such challenges. Thanks!

Best Regards,

Sarnath

Hi Sarnath.

Trust me, I’ve been frying, all weekend and today to be exact. This is a physics simulation code.

Cross_sec is malloc’ed (cudamalloc) to be < 20 (could be larger but I dont think I’ll ever use more), and each of those <20 elements then points to a xs struct. These xs struct member vars (fission, rad_cap, etc…) are malloced to be anywhere from 100-33,000 elements long.

Any more information I can give you?

Adam

Sarnath · August 4, 2009, 10:43am

@iAPX,

Thanks for your answers… While using atomics, one need to make sure that one representative thread from each block fights the block-level lock.
And then threads inside the block, fight for a shared memory lock before one thread finally gets exclusive access.
This way, it works faster. If you allow all threads to fight for the lock, it will be very very sloooooooooooowwww. I am sure you know this info. Just my 2 cents…

@Smokey,
Nice to know. It is really a smart way of doing things!

@Adam (laxsu19),
Thanks for the info. I just wanted to know if there are people suffering from this problem… Good to know there are areas which require all these deeeeeep-copies. Good Luck!

Thanks a lot for all your time!

Best Regards,
Sarnath

Topic		Replies	Views
General Purpose GPU Programming Is there a better way? CUDA Programming and Performance	10	10838	September 26, 2009
Translate Struct for CUDA CUDA Programming and Performance	11	14011	December 3, 2010
Cuda Tree Structures CUDA Programming and Performance	3	2617	August 28, 2013
challenges of building trees on gpu pointers, mem allocs CUDA Programming and Performance	2	2776	July 22, 2007
How to replace STL types CUDA Programming and Performance	7	2641	November 11, 2010
graphs and structures on gpu CUDA Programming and Performance	2	1090	March 3, 2013
linked lists in CUDA Creating linked lists with CUDA CUDA Programming and Performance	8	6900	September 16, 2008
Tree structure in CUDA CUDA Programming and Performance	3	2624	June 11, 2017
questions of parallel algorithm and data structure in relational data CUDA Programming and Performance	0	419	April 11, 2017
Struct in CUDA can i use this struct in CUDA CUDA Programming and Performance	15	89634	June 26, 2009

What kind of data structures do you use?

Related topics