Warp Synchronisation Problem?

trex · November 20, 2008, 1:25am

Lets assume I have 2 warps in my block. And for the sake of argument this is how I want to do it (or at least this is what it comes down to at the end of the day)

if( first warp )

{

//Do something that requires __syncthreads(), (synchronise the threads inside the warp) 

}

__syncthreads() , (synchronise the warps) 

if( second warp )

{

// do something that requires __syncthreads(), (synchronise the threads inside the warp) 

}

__syncthreads(), (synchronise the warps)

Will it work?

Essentially is it possible to have that sync point to force the second warp to wait for the first warp, while the first warp has its own syncing?

Ailleur · November 20, 2008, 1:52am

Always dangerous to put a syncthread in an if clause (if im reading your code right)

From the programming guide:

trex · November 20, 2008, 2:39am

So it would seem… looks like it doesn’t work and crashes. oh well, I’ll add warp synchronisation on my wishlist. ;)

alex_dubinsky · November 20, 2008, 6:35am

You don’t need to sync inside one warp since you’re guaranteed that the 32 threads in a warp work in lockstep. (At least, you’re guaranteed this on an NVIDIA GPU. On a CPU or a Cell, if that ever comes around, this will probably break.)

Gregory_Diamos · November 30, 2008, 7:34pm

The definition of warps implies that they run in lockstep. However, it is possible (on any of the architectures that you mentioned) to dynamically split and recombine warps as in [1]. In PTX, it is possible to force warps to remain convergent by supplying the .uni suffix to all branch instructions, though you should guarantee that the threads within a warp always evaluate the branch instructions identically. I do not know of an equivalent operation using the higher level CUDA API.

You might be able to assume that units of the warpSize variable will be run in lock-step unless they encounter divergent control flow, but that could be changed in future releases of CUDA. And as alex mentioned, software implementations of warps on other architectures will almost certainly not provide this guarantee: neither SSE4 nor SPU provide enough functionality to run all PTX instructions packed into vectors.

[1] Fung, W.W.L.; Sham, I.; Yuan, G.; Aamodt, T.M., “Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow,” Microarchitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International Symposium on , vol., no., pp.407-420, 1-5 Dec. 2007