Variable length data structure in CUDA How to make a variable length data structure like list effici

I am using a kernel to scan input stream (size N), 0-n results may generated at each kernel. So a result stream of size M (M != N) will be generated and use as input stream again.

The problem is: are there any efficient method to collect these results?

Currently I used a array with an index, and the index is increased by atomicAdd. But I don’t think it is a good way. Could anyone have experiences on similar problem for sharing?

Thanks in advance,


It’s a very common problem, both for reading and writing. This occurs a lot with “jobs” you’re dynamically assigning or reporting results with.

AtomicAdd DOES work but as you say isn’t the most efficient. One trick is to report results in CHUNKS. Accumulate answers in your block shared memory, and when you get “many” of them, use ONE atomic add to make room for the whole group. This works well when your results are small and common.

Another common strategy is to allocate a fixed length array, partition it so each block “owns” a range (or sometimes each input item corresponds to an output). Then there can not be any multiple writes. The disadvantage in this case is there are holes in the output that the user needs to skip over.
These holes can also be removed by using a compaction pass.

A super-fancy way is to use manual dynamic allocation especially if output order doesn’t matter. At runtime, blocks request an output range. This range is found and owned safely using atomic exchange. The block writes its results, and “returns” (via another atomic exchange) any leftover room to the allocator (which stores available partial chunks as a linked list). Later chunks will be able to fill in the gaps with more results. This works pretty well for me when parallel building variable length KD trees for raytracing… it’s actually a full dynamic memory allocator and not just a simple list.

Thanks. I will try to use share memory. It looks more efficient.

The the sencod one, a stream reduction pass will need to remove empty items.

For the third one, how about its efficiency? I think it is a more generic solution for this problem, but it looks complex for implementation.