I’ve been working on implementing a work-queue like structure in CUDA for an application that I am writing. To give a simple sense of what my app looks like, think about a multiple-level binary tree, where each node runs as a CUDA-block. There exist data dependencies between each level, preventing me from just launching the whole tree. So naturally, this means using multiple kernel calls - which incurs overhead (and yes, it is significant in my application - estimated around 15-25% of my actual run time) and also means less and less work for the GPU as you launch kernels for the top nodes (worst case, one block!).
However, my solution to try and remove multiple kernel calls was as follows: I only launch the total number of blocks as can concurrently run on my GPU at a time. Rather than use the blockidx.x, each block gets its “Block ID” by doing an atomicInc to get an index into the work queue. This way, I can enforce the order of which my blocks execute. It also has allowed me to get back around 10-15% of my execution time in best cases.
However, I’m wondering if anyone else has any ideas for similar types of applications? Basically, a structure with data-dependencies similar to a binary tree, where a uniform kernel is used for a hierarchical-like structure with data dependencies that are typically resolved using multiple kernel calls? The reduction application in the SDK kind of fits this as well.