Thanks to the GPU, I should be able to run some expensive computation in parallel with a very large number of threads. I’m currently struggling to come up with a good scheme for gathering those results.
For the moment, let’s assume I’m working with 1024 threads for block/SM, and that I’m using all 80 SMs on a V100. Let’s also assume that as each thread reaches the last phase of its computation, it starts producing “items” that consist of a pair of integers and plus a list of integers that vary in length from 1 to 100.
I can easily setup each thread to have its own area in memory to write these items of varying length. However, I’m concerned that if each thread is writing to its own disjoint area, that the uncoalesced writes will yield horrible performance.
Another alternative is to group the threads by WARP and interleave each WARP’s threads’ data within one region of memory (per WARP). Unfortunately, each thread will have data of such varying length, that I expect the writes to quickly un-coalesce.
I can provide code to elucidate the above two paragraphs, but I think what really needs to be done is for me to state what I’m really looking for:
What I feel I need is a mailbox primitive. The order of delivery doesn’t matter, and each item contains enough information to indicate how it should be sorted/stored on the receiving end. What I think I need is a high-performance mechanism to send each item from any thread on the GPU to a CPU thread that’s waiting to receive items (and then store them accordingly in STL objects in main [non-GPU] memory).
Does such a mailbox primitive exist?
I looked in the GPU Computing Gems book, searched for “gather”, “mailbox”, and “channel.” It looks like there’s a chance the FEM Solver gather may be relevant, but it didn’t seem applicable at a first glance. I also searched on the nvidia forums for “mailbox” but didn’t find anything obviously relevant. Maybe the primitive doesn’t exist, or maybe I’m just missing it.