Does _shfl_down_sync imply that all threads are involved, even out of the branch?
Code
//test.cu
#include <stdio.h>
__global__
void testkernel()
{
#ifndef FULLMASK
unsigned int mask = __ballot_sync(0xffffffff, threadIdx.x < 16);
#endif
if(threadIdx.x < 16)
{
int target = threadIdx.x;
for (int iter = 1; iter <= 8; iter <<= 1)
{
#ifdef FULLMASK
int datafetch = __shfl_down_sync(0xffffffff, target, iter);
#else
int datafetch = __shfl_down_sync(mask, target, iter);
#endif
printf("pass\n");
target += datafetch;
}
}
__syncthreads();
}
int main()
{
testkernel<<<1, 32>>>();
cudaDeviceSynchronize();
return 0;
}
Built by:
nvcc -o test test.cu -DFULLMASK
or
nvcc -o test test.cu
Some Google results show that warp intrinsics on modern architecture work well with branch divergence (e.g., if).
It is true when the __syncthreads() is removed from my code. However, a more complicated scenario needs a barrier, leading to the deadlock built under the FULLMASK.
I am confused about how the threads (or 32 threads in one warp) are organized around all kinds of warp intrinsic when if is nested.
Does the divergence happen on the entry of if and the execution of __shfl_down_sync (especially in the view of hardware)?
The deadlock in my code or on my PC indicates that half of the threads idle on the __shfl_down_sync with full mask activated. It behaves with no effect from the if statement.
Thank you if you could provide some comments about this case regarding the hardware level.
Yes, and no. In typical usage, the expectation is that all threads named in the mask are participating. In general, if you violate that, you are exploring UB. However there is an exception as noted in both of the above links: If the warp schedulers can find matching instructions both in and out of the current control flow, such that the combination of those two instructions (issues) would result in satisfying the member mask requirement, that is also acceptable.
It’s also important to note that there is a distinction in behavior (as well as what is allowed) in pre-volta architectures vs. volta and beyond. The above “exception” I listed, for example, only applies to volta and beyond. In a prevolta architecture, when the currently active threads do not satisfy the member mask, you are exploring UB, with no exceptions.
Yes.
The reason your code hangs is due to the __syncthreads() as you have already pointed out, and this exact scenario is covered in one of the links I provided. Basically, the rules for satisfaction of the member mask also give an exception for “exited threads”. Without the syncthreads, the threads not participating due to control flow also exit (at least, on volta and beyond). Therefore the control-flow-selected participating threads in the shuffle op can satisfy the member mask when combined with exited threads, and so there is no hang: the shuffle op proceeds. However, when you include the __syncthreads(), the threads not participating (due to control flow) in the shuffle op are also not exited; as a result the shuffle op sync requirement is never met, and the code hangs.
I will also point out that none of this contravenes the fact that if a shuffle op involves non-participating threads (which would be the case for exited threads) then the warp-lane transfers from those non-participating threads are also UB. If your shuffle op specifies movement of data from a non-participating thread (due to control flow, or exiting) then any reference to data from non-participating threads is undefined.