I’m hoping someone could give me some advice on how to solve a problem I’ve run into…
I’m planning out the implementation of an algorithm that has multiple stages.
Currently, each stage is mapped to a kernel where each thread block does some processing on a single element in an array.
If there is more work to be done on an element after the current stage is completed, thread 0 of the block will write the index of the element into out[blockIdx.x], else it will write a 0.
For the next stage, I only need to launch blocks for the elements that need to be processed further. This number can be determined by using the ‘out’ array from the last stage.
However, I’m having a hard time finding a way for the blocks to determine which element to work on because the indices of elements are mixed in with many non-work elements (0s) in the ‘out’ array.
I’ve come up with a few solutions but none of them are ideal…
1) Launch the same number of blocks each time. Blocks that have a 0 in the output array from the last stage will return immediately
Pros: Easy to implement, no need to copy memory back to host, no extra processing needed
Cons: Overhead from launching extra blocks, wasted resources, unnecessary work & overhead for the block scheduler
2) Copy the output array back to the host and have the host store the non-0 indices that need more processing into a new array. The new array can then be copied to the next kernel which launches only the required number of blocks.
Pros: Only the required number of blocks are launched
Cons: Memory copy to the host and back
3) Launch another kernel to accomplish (2) above instead of doing it on the host. Host reads a single integer value back that states how many blocks need to be launched.
Pros: No need to copy output array back to the host. Could use zero-copy to retrieve the single integer (so it would be relatively fast?)
Cons: Would only launch 1 block (no benefit from launching more?), would probably require atomics to control placement of indices in new array. Still need to copy back an integer to host
4) Store -1s instead of 0s if there is not more work to be done. Between stages, launch a kernel to do a parallel sort, moving all of the -1s to the start of the array. Then count the number of -1s and store that value in global memory so (a) the next stage can use it to determine where to start reading and (b) The host will know how many blocks to launch.
Pros: No need to copy output array back to the host. Could use zero-copy to retrieve the single integer (so it would be relatively fast?). Parallel sort would hopefully be more efficient than the kernel in (3) above.
Cons: Still need to copy back an integer to host, counting number of -1s would probably require atomics (or possibly a reduction which would mean another kernel call?)
This is probably most similar to the following two threads but I don’t think either of them accomplishes quite what I need.
[url=“http://forums.nvidia.com/index.php?showtopic=95451”]http://forums.nvidia.com/index.php?showtopic=95451[/url]
[url=“http://forums.nvidia.com/index.php?showtopic=91737”]http://forums.nvidia.com/index.php?showtopic=91737[/url]
I would really appreciate any advice you guys could offer. Thanks!