I am trying to solve one interesting problem related to cuda programming.
I have a piece of cuda-code which processes a large number of small block, each containing a fixed number of bodies. Each of these bodies process identical data set, which also depends on the block id. Simple example of such problem is an n-body problem, were each of these bodies in the block interact with the same set of other bodies, except that this set depends on the block id. There are myriad of implementation of such algorithm, so problem is solved as such.
My interest is to have number of bodies in a block equal to 16 or 32. Ideally, I’d like to use blockSize of 16 or 32 respectively, but unfortunately cuda-performance drops if blockSize < 64. With blockSize=64, only 1/4 or 1/2 of threads are active respectively, so the performance is lost. Use of large number of bodies is possible, but is not desirable. Though the first thing that may come to mind is to split the data set into smaller number of parts if number of bodies < 64, as shown in GPU gems, n-body problem. Unfortunately, it is not possible in my case because the data set is build as it processed, and so its size is not know at the begining.
The solution, which I thought may work, is to logically subdivide a block into warps, i.e. warpId = tid / 32 & warp_tid = tid % 32. Here, I rely on the assumption that GPU splits a block into a nWarp=blockSize/32 number of warps, with each warp processing threads from warpId * 32 to warpId * 32 + 31. Since a given instruction is executed by all threads in a given warp concurrently, and warps do not harass each other data, by design of the algorithm, the race conditions will not occurs, and therefore there is no need to use __syncthreads(); as a bonus each warp has its own code-path.
In this case I can decrease the minimal parallel block of the problem, and yet efficiently use the hardware.
Since it is way to hard to debug a code w/o __syncthreads() (such code won’t run correctly in the device-emulation mode), I wrote the code with __syncthreads() to assure that it works in both emulation & native mode, and run it with blockSize = 32. The code runs as expected, in both modes.
So I proceed removing __syncthreads() from everywhere to test if the code runs. Here I run in myriad of difficulties. Removal of some __syncthreads() do not alter the behaviour of the code (though it will with blockSize = 64 and 64 bodies per block), but there are some places in the code where removal of this barrier causes lock-up of GPU, and this is reproducible. Such lock-ups I can only recover from by cold reboot. :S
Has anyone thought of / implemented such algorithm, which rely on warp-parallelism instead of block-parallelism? Since it is extremely hard to debug the code to nail down the problem–every time code crashes, the PC requires a cold reboot–, and that there could be a bug in the code it self, my other question is whether the assumption that the threads in a block are grouped into warps sequentially is correct or not? Because if it is not, such a scheme will not work and I will have to stick with 64 bodies per thread.