As we know, GPU is well accepted as an SIMD engine that pacts all the threads in a manner that execute one instruction for multiple data at the same time. And it’s highly suggested that a warp of 32 threads not be broken up minimally for branching to avoid significant performance drop. I’m wondering can we, one level higher, break up a block into set of groups of threads which is multiple of warp size and dispatch different task to them individually?
For example, can we assign thread 0-63 read the data in, and thread 64-127 process function 1, and thread 128-255 process function 2 at the same time? How does this compared to the performance when we don’t split the block and let all threads process these functions sequentially.
Even if I only use fixed number of warps in a block, say using 8 warps of 256 threads in a block, there is no problem of launching
the same number of blocks per MP to hide latency, what will happen if I let first 2 warps process function 1, the next two warps process function 2, and so on?
This works just fine, apart from the issue that you might run out of resources (max warps, registers, shared memory) earlier than with a uniform workload, as kbam mentioned.
[font=“Courier New”]__syncthreads()[/font] of course synchronizes all threads of a block. If however you are taking advantage of the fact that threads of individual warps are running in lockstep (declaring all variables used as [font=“Courier New”]volatile[/font]) then yes, this only applies to single warps.
With “lockstep” I mean that a warp’s instructions are always issues together (as long as there are no divergent branches).
The SDK reduction example explains how you may avoid [font=“Courier New”]__syncthreads()[/font] for synchronization within a warp if you prevent the compiler from optimizing away accesses to shared variables.