__ballot_sync inside for loop causes kernel to hang in sm_75

I recently encountered a problem when using __ballot_sync(). The simplified code is like the following:

#define A 333

__device__ inline void shuffle_then_add(unsigned mask, int threadid, float *dst, float var)
    for (int offset = 16; offset > 0; offset/=2)
    	float right = __shfl_down_sync(mask, var, offset);
    	var += (((threadid & 31)+offset)<32) ? right : 0;
    if ((threadid & 31) == 0)
    	atomicAdd(dst, var);

__global__ void __launch_bounds__(512) kernel_inference(const float* __restrict__ src, float* __restrict__ out)
	for(int x=threadIdx.x;x<A+1;x+=blockDim.x)
		unsigned mask = __ballot_sync(0xffffffff, x<A);
		if (x<A)
			shuffle_then_add(mask, threadIdx.x, out, src[x]);
			atomicAdd(out, fabsf(src[x]));

Note: the shuffle_then_add() function is following the tree reduction example in https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/

When compiled for sm_61 (with CUDA 10.2) and runs on 1080Ti, this code runs fine; but when compiled for sm_75 (with either CUDA 10.2 or 11.1) and runs on 2080Ti, the kernel is hanging when A +1 is NOT a multiple of 32 (and less than 512).
I’m calling this kernel with a single thread block of size 512.

Is my usage of __ballot_sync() inside a for loop correct (for arbitrary constant A)?
If not, how to fix it?
Thank you!

Quoting from the documentation:

All active threads named in mask must execute the same intrinsic with the same mask, or the result is undefined.

If you pass 0xFFFFFFFF to the function, every thread in the warp must call this function. This may not be the case because of the loop condition

one possible method to fix is given in the blog you already linked, in the section “opportunistic…”

unsigned mask = __ballot_sync(__activemask(), x<A);

whether or not that is sensible for your code I can’t say. In particular, you haven’t shown how A is computed. Unless that is constant across the warp, your usage of shuffle_then_add doesn’t make sense to me. Even then, I have not studied your code carefully. So it’s just a suggestion.

Thanks! So you suggest to round the loop constant up to multiples of 32?

Yes, A is constant, but not multiple of 32. I’m confused by this sentence in the blog though. The CUDA execution model does not guarantee that all threads taking the branch together will execute the __activemask() together.

The volta execution model permits the possibility of warp divergence at any point in time. Developers probably shouldn’t generally assume they can predict divergence state by looking at source code.

Therefore, my suggestion would be that if you go with the opportunistic model, you should design your code in such a way that it produces the right answer in spite of “arbitrary” divergence. That probably requires careful coding design, and a casual read of the limited code you have shown suggests to me that it would not be fully compliant with this model, in its current form.

The alternative would be to design your code in a prescriptive fashion, ie. require a particular warp divergence state through the use of proper conditional coding as well as _sync() constructs. In your original case, you violated a requirement of that prescriptive model.

I cannot state whether “rounding” is the correct solution. I would not ordinarily assume that I can take a constant in a particular program, and “round” it to another value, and have correct behavior. Nevertheless, the prescriptive approach (i.e. get rid of the original error you had, somehow) may be the most straightforward approach.

Ok, but we do need to guarantee getting all the threads satisfying (x<A) into the mask in this line of code.
Otherwise the semantics of tree reduction is wrong (there is some gap in the list of threads, the result cannot be passed up to root).

Then get rid of your original coding error that prevented that. I can’t tell you precisely how to do that. Please reread my previous comment which I edited.