Can different warps in a block do different things?


As we know, GPU is well accepted as an SIMD engine that pacts all the threads in a manner that execute one instruction for multiple data at the same time. And it’s highly suggested that a warp of 32 threads not be broken up minimally for branching to avoid significant performance drop. I’m wondering can we, one level higher, break up a block into set of groups of threads which is multiple of warp size and dispatch different task to them individually?

For example, can we assign thread 0-63 read the data in, and thread 64-127 process function 1, and thread 128-255 process function 2 at the same time? How does this compared to the performance when we don’t split the block and let all threads process these functions sequentially.


It means more warps per block, which could lead to fewer blocks per MP, which means less opportunity to hide latency, so slower.

Not sure if there may also be a drawback if the kernel needs to use local memory (more threads (warps) so more local memory used and impact on cache.

Looks like time individual blocks spends in MP will be lower (providing latency isn’t an issue), but elapsed time for a lot of blocks will be higher.

Even if I only use fixed number of warps in a block, say using 8 warps of 256 threads in a block, there is no problem of launching

the same number of blocks per MP to hide latency, what will happen if I let first 2 warps process function 1, the next two warps process function 2, and so on?

This works just fine, apart from the issue that you might run out of resources (max warps, registers, shared memory) earlier than with a uniform workload, as kbam mentioned.

I see:) I was just suspicious whether these too many conditional branches towards threads will cause overhead in flow control and synchronization.

Is this correct that synchronization only happen for threads in a single warp, rather than sync all the warps in a block?

[font=“Courier New”]__syncthreads()[/font] of course synchronizes all threads of a block. If however you are taking advantage of the fact that threads of individual warps are running in lockstep (declaring all variables used as [font=“Courier New”]volatile[/font]) then yes, this only applies to single warps.

Thanks for the reply tera.

But what do you mean by “lockstep”? And what is the relation of lockstep and volatile variables as well as non-volatile regular variables?

With “lockstep” I mean that a warp’s instructions are always issues together (as long as there are no divergent branches).

The SDK reduction example explains how you may avoid [font=“Courier New”]__syncthreads()[/font] for synchronization within a warp if you prevent the compiler from optimizing away accesses to shared variables.

I’ll look at them example. Thanks very much!