Hey all,
I have been trying to keep up with the latest cuda 9 standard, and have a question about falling into the pitfall of assuming “implicit warp synchronization.” I apologize if I am totally misunderstanding this concept, but I will give it a shot.
I think an example of this would be the following:
if (threadID < 16)
Do something;
else
Do something else;
assert(__activemask() == 0xffffffff); //can fail, even though one may think that the warp should synchronize here
My question is, if we use functions like __shfl_sync, which accept a mask in its interface, should we always call __syncwarp() prior to every shfl function call if we want a full warp to execute it? Thats my impression from the reading even though I never see that as an example done like that (I do see people use now syncwarps in data races with shared memory, but not in say shfl reductions).
In other words, is it true that __shfl_sync(0xFFFFFFFF, …) may not actually apply to all threads at the same time since the programming guide tells me that __shfl_sync functions work on ACTIVE threads within the mask. And so if I have an if statement anywhere above the shfl functions, even if that statement has already been resolved, is possible that not all threads are considered active by the time it gets to a shuffle statement?
In our simple example that means
if (threadID < 16)
Do something;
else
Do something else;
…
…
…
__syncwarp(); //Is this necessary?
_shfl_sync(0xFFFFFFFF, …)
Furthermore do all subsequent shfl calls need a syncwarp before them?
Note: if someone replies that the inactive threads will later be active and call the same shfl function at some later execution cycle and so there is no problem. My response would be that is not necessarily good cause the next instruction maybe another shfl call that assumes the previous one is already completed and performed on all threads (like in a shfl reduction algorithm). And if the first active thread set calls that next shfl before all threads have called the previous one, I think you could get a different result than intended.
Lastly, if I remove the __syncwarp above and I replace
_shfl_sync(0xFFFFFFFF, …)
with
unsigned mask = __activemask(); _shfl_sync(mask, …) ;
is there any difference? if the function only executes the mask for the active threads I do not see a difference.
I only list these examples because many people have told me all I need to do to update legacy code shfl calls, is to either use the __activemask(), or just a full mask of 0xFFFFFFFF, but I do not think it is that simple.
Thanks ahead of time for any help