warp parallelism How to decreasing the size of a parallel element

Hi all,

I am trying to solve one interesting problem related to cuda programming.

I have a piece of cuda-code which processes a large number of small block, each containing a fixed number of bodies. Each of these bodies process identical data set, which also depends on the block id. Simple example of such problem is an n-body problem, were each of these bodies in the block interact with the same set of other bodies, except that this set depends on the block id. There are myriad of implementation of such algorithm, so problem is solved as such.

My interest is to have number of bodies in a block equal to 16 or 32. Ideally, I’d like to use blockSize of 16 or 32 respectively, but unfortunately cuda-performance drops if blockSize < 64. With blockSize=64, only 1/4 or 1/2 of threads are active respectively, so the performance is lost. Use of large number of bodies is possible, but is not desirable. Though the first thing that may come to mind is to split the data set into smaller number of parts if number of bodies < 64, as shown in GPU gems, n-body problem. Unfortunately, it is not possible in my case because the data set is build as it processed, and so its size is not know at the begining.

The solution, which I thought may work, is to logically subdivide a block into warps, i.e. warpId = tid / 32 & warp_tid = tid % 32. Here, I rely on the assumption that GPU splits a block into a nWarp=blockSize/32 number of warps, with each warp processing threads from warpId * 32 to warpId * 32 + 31. Since a given instruction is executed by all threads in a given warp concurrently, and warps do not harass each other data, by design of the algorithm, the race conditions will not occurs, and therefore there is no need to use __syncthreads(); as a bonus each warp has its own code-path.

In this case I can decrease the minimal parallel block of the problem, and yet efficiently use the hardware.

Since it is way to hard to debug a code w/o __syncthreads() (such code won’t run correctly in the device-emulation mode), I wrote the code with __syncthreads() to assure that it works in both emulation & native mode, and run it with blockSize = 32. The code runs as expected, in both modes.

So I proceed removing __syncthreads() from everywhere to test if the code runs. Here I run in myriad of difficulties. Removal of some __syncthreads() do not alter the behaviour of the code (though it will with blockSize = 64 and 64 bodies per block), but there are some places in the code where removal of this barrier causes lock-up of GPU, and this is reproducible. Such lock-ups I can only recover from by cold reboot. :S

Has anyone thought of / implemented such algorithm, which rely on warp-parallelism instead of block-parallelism? Since it is extremely hard to debug the code to nail down the problem–every time code crashes, the PC requires a cold reboot–, and that there could be a bug in the code it self, my other question is whether the assumption that the threads in a block are grouped into warps sequentially is correct or not? Because if it is not, such a scheme will not work and I will have to stick with 64 bodies per thread.



I think I have done it once … each warp following one path… I also had 64 thread per block…
It worked… But the code looked nasty…
So, dropped that idea…

But your lockup suggests me a “deadlock” prob. u probably r using sycnthreads inside a “IF” or “FOR” loop that is NOT being executed by all threads of a block… (may b, only few warps execute it)…

Well, I fixed the problem. It was a bug in a code. Not sure which one, since instead of hunting it was easier to rewrite necessary parts of the code from scratch. The speed benefit, at least in my case, is marginal (10%). The code however has pretty much same complexity as the original, as far as shared memory is properly divided between warps, and a warp does not harass shared memory of other warps.

The good part is that indeed it appears that threads in a block are combined into warps in sequence from 0 to 31, 32…63, etc. And so if the code designed for warp-parallelism, no __syncthreads() is required anymore. Now, it is indeed possible to design production codes which have a warp as a parallel unit, instead of a block. But one should exercise caution since if the size of a warp changes, the code may not work anymore, i.e. such a code may have a limited portability.

I doubt this, since the warp-parallel code has no __syncthreads(). But because the code was buggy, one of the while-loop with an incorrect variable may have caused a deadlock.



There is a glitch having a warp-size block.

The warp size is 32.

However, the number of registers used by your app = 64registers (and NOT 32registers)
Check the CUDA occupancy calculator

This one is the biggest hurdle having 32 threads per block.

If you can live with that – this model works good.