Hi,
I am experiencing some unexpected behaviour from using atomic operations on Volta GPUs.
Here is a simplified version of my kernel:
__global__
void count_kernel(size_t const size, int32_t const threshold, int32_t * index, int32_t const * __restrict__ value, int32_t * result)
{
__shared__ int32_t count;
if (threadIdx.x == 0)
count = 0;
__syncthreads();
for (int i = threadIdx.x; i < size; i += blockDim.x) {
if (index[i] == i) {
if (value[i] >= threshold) {
// signal
auto old = atomicAdd_block(&count, 1);
index[i] = -(old + 1);
} else {
// noise
index[i] = -9999; // Barrier error detected. Divergent thread(s) in warp
}
}
}
__syncthreads(); // Barrier error detected. Divergent thread(s) in warp
if (threadIdx.x == 0)
*result = count;
}
The idea is to count how many elements match certain criteria; this is done keeping a counter in shared memory and using atomicAdd (or atomicAdd_block) to increment it atomically.
The kernel works as expected on Kepler (sm_35) and Pascal (sm_60) GPUs.
On Volta (sm_70) GPUs the behaviour is erratic, with different outcomes depending on the compilation flags and whether running under cuda-memcheck.
In fact, cuda-memcheck --tool synccheck reports “Barrier error detected. Divergent thread(s) in warp” at the two commented lines.
I have read about the changes in the warp execution of Volta with respect to previous architectures, but I could not find any mentions of atomic operations.
Is the change in behaviour expected ?
A self-contained example is available at https://cernbox.cern.ch/index.php/s/l6miDCrqyHopEzF .
Thank you,
.Andrea
P.S.
despite what the Programming Guide says, compiling with “-gencode arch=compute_60,code=sm_70” does not recover the behaviour from Pascal.