Gathering results at the end of computation -- looking for a mailbox


Thanks to the GPU, I should be able to run some expensive computation in parallel with a very large number of threads. I’m currently struggling to come up with a good scheme for gathering those results.

For the moment, let’s assume I’m working with 1024 threads for block/SM, and that I’m using all 80 SMs on a V100. Let’s also assume that as each thread reaches the last phase of its computation, it starts producing “items” that consist of a pair of integers and plus a list of integers that vary in length from 1 to 100.

I can easily setup each thread to have its own area in memory to write these items of varying length. However, I’m concerned that if each thread is writing to its own disjoint area, that the uncoalesced writes will yield horrible performance.

Another alternative is to group the threads by WARP and interleave each WARP’s threads’ data within one region of memory (per WARP). Unfortunately, each thread will have data of such varying length, that I expect the writes to quickly un-coalesce.

I can provide code to elucidate the above two paragraphs, but I think what really needs to be done is for me to state what I’m really looking for:

What I feel I need is a mailbox primitive. The order of delivery doesn’t matter, and each item contains enough information to indicate how it should be sorted/stored on the receiving end. What I think I need is a high-performance mechanism to send each item from any thread on the GPU to a CPU thread that’s waiting to receive items (and then store them accordingly in STL objects in main [non-GPU] memory).

Does such a mailbox primitive exist?

I looked in the GPU Computing Gems book, searched for “gather”, “mailbox”, and “channel.” It looks like there’s a chance the FEM Solver gather may be relevant, but it didn’t seem applicable at a first glance. I also searched on the nvidia forums for “mailbox” but didn’t find anything obviously relevant. Maybe the primitive doesn’t exist, or maybe I’m just missing it.


If this were my code, I would quickly prototype the two design alternatives already identified, then check the performance. Since we do not know what else is going on in the kernel, the write performance may or may not be a significant issue for application performance. Benchmarking (and if need be, profiling) the two prototype implementations will quickly reveal whether it is an issue or not.

You can use atomicAdd() to quickly determine where your variable-length results should go.

I wouldn’t worry too much about the uncoalesced writes before you actually benchmarked your code - GPUs have become better dealing with less ideal access patterns with the help of their caches
You may be able to use int4 to write out the bulk of the data via wide transactions, particularly if you can afford padding to multiples of four integers. If your integers fit into 16 or fewer bits, you can stretch the memory bandwidth further by using shorts.