Hi,

I recently encountered a problem when using __ballot_sync(). The simplified code is like the following:

```
#define A 333
__device__ inline void shuffle_then_add(unsigned mask, int threadid, float *dst, float var)
{
for (int offset = 16; offset > 0; offset/=2)
{
float right = __shfl_down_sync(mask, var, offset);
var += (((threadid & 31)+offset)<32) ? right : 0;
}
if ((threadid & 31) == 0)
atomicAdd(dst, var);
}
__global__ void __launch_bounds__(512) kernel_inference(const float* __restrict__ src, float* __restrict__ out)
{
for(int x=threadIdx.x;x<A+1;x+=blockDim.x)
{
unsigned mask = __ballot_sync(0xffffffff, x<A);
if (x<A)
{
shuffle_then_add(mask, threadIdx.x, out, src[x]);
}
else
{
atomicAdd(out, fabsf(src[x]));
}
}
}
```

Note: the `shuffle_then_add()`

function is following the tree reduction example in https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/

When compiled for sm_61 (with CUDA 10.2) and runs on 1080Ti, this code runs fine; but when compiled for sm_75 (with either CUDA 10.2 or 11.1) and runs on 2080Ti, the kernel is hanging when A +1 is **NOT a multiple of 32** (and less than 512).

I’m calling this kernel with a single thread block of size 512.

Is my usage of __ballot_sync() inside a for loop correct (for arbitrary constant A)?

If not, how to fix it?

Thank you!