And, thats for the chess thing…, right? (from your signature)
Interesting that you get to use all these…
btw,
Was this application developed from scratch on GPU ? ( I would bet on this one)
or
Did you port a CPU implementation?
Yeah, it must have been tough. These structures are not parallel friendly… OR may be, you need to have a parallel friendly version of these data-structures…
I am just designing something… So, I was just looking - what features would matter and what not? Its more like a survey to find what a potential user would need.
When I am done, we will make an alpha release in the forum. Let us see how that one goes…
Binary trees more than anything, but also circular buffers (not lists) and hash tables (of sorts).
For what it’s worth, all of the above I implement as structures wrapping data in a texture - except the circular buffer.
(1D texture for binary trees, 2D texture for hash tabley things)
I guess you use Textures for the read-only data structures. Also, The pointer chasing looks to be easened out by textures…
And the circular buffer must have been a RW data-structure. Is there any other reason why you don’t use texture for accessing circular buffer?
And, I guess you copy over the binary tree/hash table from host to GPU memory. That must have been terrible work, Isn’t it?
Do you replicate the binary tree as such in the GPU memory OR Do you use a different form of it?
btw, What is the kind of problem that you are solving? I am interested to know what kind of HPC problems require all these stuff…
No other reason than the fact my circular buffers are always RW (larger ones are in gmem/lmem, I also use smaller ones as a cache of sorts - in smem).
And yes, I construct the tree on the CPU - and copy it over to the GPU - I specifically designed these tree structures s.t. the memory layout is identical between the CPU/GPU (to the benefit of the GPU, the CPU has to do extra work to layout memory in this way) - so copying from CPU to GPU is just a single copy - not that bad - and these are constructed once-off anyway (so the copying overhead isn’t an issue) - however I do have plans to actually build these trees on the GPU as well - when the need arises for me to rebuild these trees often.
I’m mostly dealing with real time physics simulations and computer graphics (ray tracing) at home, and my day job is computer vision & machine learning work (also CUDA) - but CV doesn’t really use too many special data structures, machine learning uses hash tables among other things… depending on what you’re doing.
Do you construct the tree inside a big array and use “index count” as the next-pointer. so, you really dont need to mess around with pointers. Is that right?
Do you construct your hash tables that way too?
This way you can avoid deep-copying of data structures… Is that right?
With deep copies. This came because I was mostly porting a code, I think I would have done this differently had I started from scratch on the gpu though. Man it was a pain to copy that data structure to the gpu, including the data being pointed to by the xs*…
It is NOT a try to port classical CPU Chess Engine to GPU, but instead to port classical Chess Engines Algorithms on a GPU, with algorithms devised for the GPU, and multi-threading organized around the different context-level of a CUDA GPU (Scalar Processor/Multi Processor/GPU/Multi-GPU on a PC), their weaknesses and incredible strengths.
Hashtable is classical in chess computer, so I need it to accelerate computations in end-game situations. Or at least find an implementation that will be stisfying in term of performance on a GPU :-)
And I need FIFO Queues of all sorts for thread communication, trying to avoid using Atomic functions as much as possible (too ssllloooowwwwww).
More or less, my texture is basically one big array of ‘nodes’ in the tree, and instead of having left/right child pointers, I have indices (which reference the texture).
Avoids pointer issues - making the code more portable overall (can copy the same data structure between almost any architecture seamlessly with no intermediate fixes required, be it CPU<->CUDA<->OpenCL).
Not quite, the hash tables are actually generated on the GPU (using smem/gmem), and then later on in another kernel referenced as textures. But I still use the same methodology as I do with the trees (avoiding pointers, using indices).
Yup! :) But the down side is, depending on the structure - it’s not easy to add/remove elements - while maintaining any sane ordering of elements to take advantage of the texture cache (as I know how I’m going to index the data, order-wise).
Trust me, I’ve been frying, all weekend and today to be exact. This is a physics simulation code.
Cross_sec is malloc’ed (cudamalloc) to be < 20 (could be larger but I dont think I’ll ever use more), and each of those <20 elements then points to a xs struct. These xs struct member vars (fission, rad_cap, etc…) are malloced to be anywhere from 100-33,000 elements long.
Thanks for your answers… While using atomics, one need to make sure that one representative thread from each block fights the block-level lock.
And then threads inside the block, fight for a shared memory lock before one thread finally gets exclusive access.
This way, it works faster. If you allow all threads to fight for the lock, it will be very very sloooooooooooowwww. I am sure you know this info. Just my 2 cents…
@Smokey,
Nice to know. It is really a smart way of doing things!
@Adam (laxsu19),
Thanks for the info. I just wanted to know if there are people suffering from this problem… Good to know there are areas which require all these deeeeeep-copies. Good Luck!