__shfl_sync requires __syncwarp for every call?

Hey all,

I have been trying to keep up with the latest cuda 9 standard, and have a question about falling into the pitfall of assuming “implicit warp synchronization.” I apologize if I am totally misunderstanding this concept, but I will give it a shot.

I think an example of this would be the following:

if (threadID < 16)
Do something;
else
Do something else;
assert(__activemask() == 0xffffffff); //can fail, even though one may think that the warp should synchronize here

My question is, if we use functions like __shfl_sync, which accept a mask in its interface, should we always call __syncwarp() prior to every shfl function call if we want a full warp to execute it? Thats my impression from the reading even though I never see that as an example done like that (I do see people use now syncwarps in data races with shared memory, but not in say shfl reductions).

In other words, is it true that __shfl_sync(0xFFFFFFFF, …) may not actually apply to all threads at the same time since the programming guide tells me that __shfl_sync functions work on ACTIVE threads within the mask. And so if I have an if statement anywhere above the shfl functions, even if that statement has already been resolved, is possible that not all threads are considered active by the time it gets to a shuffle statement?

In our simple example that means
if (threadID < 16)
Do something;
else
Do something else;



__syncwarp(); //Is this necessary?
_shfl_sync(0xFFFFFFFF, …)

Furthermore do all subsequent shfl calls need a syncwarp before them?

Note: if someone replies that the inactive threads will later be active and call the same shfl function at some later execution cycle and so there is no problem. My response would be that is not necessarily good cause the next instruction maybe another shfl call that assumes the previous one is already completed and performed on all threads (like in a shfl reduction algorithm). And if the first active thread set calls that next shfl before all threads have called the previous one, I think you could get a different result than intended.

Lastly, if I remove the __syncwarp above and I replace
_shfl_sync(0xFFFFFFFF, …)
with
unsigned mask = __activemask(); _shfl_sync(mask, …) ;

is there any difference? if the function only executes the mask for the active threads I do not see a difference.

I only list these examples because many people have told me all I need to do to update legacy code shfl calls, is to either use the __activemask(), or just a full mask of 0xFFFFFFFF, but I do not think it is that simple.

Thanks ahead of time for any help

The mask in the __shfl_sync call indicates that:

  • these warp lanes are required for this call (i.e. for the execution of this shuffle operation)
  • if those warp lanes are not currently converged, then perform the equivalent of a partial __syncthreads (or __syncwarp) operation to converge those threads/warp lanes.

The net result is that the shuffle operation will always complete correctly, guaranteed. By “correctly” I mean that it will force all threads (warp lanes) specified in the mask to participate. No extra __syncwarp call is required.

__activemask is different, of course. It is not a synchronizing operation. It merely returns true or false to indicate which warp lanes are converged (i.e. actually participating in that instruction) at that moment.

To update legacy code, I would not use __activemask. I would use the corresponding __shfl_sync operation with a “full” mask i.e. 0xFFFFFFFF

Of course if you know that your legacy code would operate correctly in a partially converged case, you might choose something other than a full mask.

Thanks very much for the reply txbob, it is very helpful. I would also assume the same behavior for any function with the *_sync name to peform the same synchronization correct?

So if you specify a mask of threads, some of which will not be active (e.g. if (threadID < 16)_shfl_sync(0xFFFFFFFF, …) ) it will hang then right?

This BTW makes sense from what I would expect __shfl_sync to do, especially since it has sync in its name I assumed it did some sort of synchronization based on the mask. I guess this part of the Programming guide

“The __shfl_sync() intrinsics permit exchanging of a variable between threads within a warp without use of shared memory. The exchange occurs simultaneously for all active threads within the warp (and named in mask)”.
confused me into thinking that the function could be called and it would only apply it to the “active” threads labels in the mask, when in reality it means, the function applies to all threads that will converge (and hence become active) listed in mask.

Thanks again for your reply

On volta, at least, it should be acceptable. Refer to slide 37 here:

http://on-demand.gputechconf.com/gtc/2017/presentation/s7622-Kyrylo-perelygin-robust-and-scalable-cuda.pdf

I probably wouldn’t be able to immediately respond to several other questions you might have that I could anticipate based on that, however.

Thats a good reference, and is exactly the kinda situation I am concerned with… In that example would the get_warp_sum call be performed on the full mask? Or solely the diverged threads? If it is the former, I can imagine potential issues w.r.t scope.

If it is the latter it means for volta it will sync the possible active threads specified within the mask, but is not requiring that the mask be fully satisfied by the possible active threads. Sadly I would have preferred it to not be that way and to break if you were not careful, but I can imagine why Nvidia would want to do it that way for simplicity.

I think then the take away for me would be, I can replace all of my shfl functions with the new version using a full mask. And if I had any shuffle functions that were inside a divergent branch, to be safe I will just specify a mask for the relevant lanes within the logic instead of blindly using a full mask, even though I guess for sure volta allows it.

Thanks again for your help and I understand if you cant respond to that last question.