unexpected behaviour from atomics on Volta

Hi,
I am experiencing some unexpected behaviour from using atomic operations on Volta GPUs.

Here is a simplified version of my kernel:

__global__
void count_kernel(size_t const size, int32_t const threshold, int32_t * index, int32_t const * __restrict__ value, int32_t * result)
{
  __shared__ int32_t count;
  if (threadIdx.x == 0)
    count = 0;
  __syncthreads();

  for (int i = threadIdx.x; i < size; i += blockDim.x) {
    if (index[i] == i) {
      if (value[i] >= threshold) {
        // signal
        auto old = atomicAdd_block(&count, 1);
        index[i] = -(old + 1);
      } else {
        // noise
        index[i] = -9999;                   // Barrier error detected. Divergent thread(s) in warp
      }
    }
  }

  __syncthreads();                          // Barrier error detected. Divergent thread(s) in warp
  if (threadIdx.x == 0)
    *result = count;
}

The idea is to count how many elements match certain criteria; this is done keeping a counter in shared memory and using atomicAdd (or atomicAdd_block) to increment it atomically.

The kernel works as expected on Kepler (sm_35) and Pascal (sm_60) GPUs.
On Volta (sm_70) GPUs the behaviour is erratic, with different outcomes depending on the compilation flags and whether running under cuda-memcheck.

In fact, cuda-memcheck --tool synccheck reports “Barrier error detected. Divergent thread(s) in warp” at the two commented lines.

I have read about the changes in the warp execution of Volta with respect to previous architectures, but I could not find any mentions of atomic operations.
Is the change in behaviour expected ?

A self-contained example is available at https://cernbox.cern.ch/index.php/s/l6miDCrqyHopEzF .

Thank you,
.Andrea

P.S.
despite what the Programming Guide says, compiling with “-gencode arch=compute_60,code=sm_70” does not recover the behaviour from Pascal.

I am seeing the same weirdness in code compiled for Turing. I’ve also tried

-gencode arch=compute_60,code=sm_75

In my case only a few threads per warp participate in the atomicAdd to a 32 bit integer in global memory. Why should this be a problem now?

In my case I fixed it by using a warp vote function __any(condition) to be sure all threads of a warp participate in the atomic, but I use “condition ? 1 : 0” in the argument to atomicAdd to ensure that only the threads matching “condition” add 1 to my integer counter.

Christian

Interestingly, if I replace

auto old = atomicAdd_block(&count, 1);

with

auto old = atomicInc_block(&count, 0xffffffff);

it seems to work on both Pascal and Volta.

Does cuda-memcheck’s synccheck tool still report “Barrier error detected. Divergent thread(s) in warp” errors when you use atomicInc_block() ?

Christian

Yes, I see the exact same behaviour from atomicAdd_block and from atomicAdd.

The two functions are compiled into different PTX:

--- atomicAdd/testAtomicAdd.compute_70.ptx      2018-10-01 17:58:56.450025723 +0200
+++ atomicAdd_block/testAtomicAdd.compute_70.ptx        2018-10-01 17:59:25.811317038 +0200
@@ -97,7 +97,7 @@ BB0_7:
 
 BB0_6:
        mov.u32         %r9, _ZZ12count_kernelmiPiPKiS_E5count;
-       atom.shared.add.u32     %r10, [%r9], 1;
+       atom.shared.cta.add.u32         %r10, [%r9], 1;
 
 ///home/fwyzard/test/atomicAdd/testAtomicAdd.cu:22         index[i] = -(old + 1);
        .loc 1 22 9

However after assembling into cubin and disassembling, the output is the same, so I guess ptxas takes advantage of the fact that shared memory is always limited to the current block.

Switching to atomicInc_block the PTX changes only slightly, from

atom.shared.cta.add.u32         %r10, [%r9], 1;

to

atom.shared.cta.inc.u32         %r10, [%r9], -1;

Instead the disassembly from atomicAdd_block:

//## File "testAtomicAdd.cu", line 25
                                   S2R R9, SR_LANEID ;
                                   VOTE.ALL R5, PT, PT ;
                                   FLO.U32 R8, R5 ;
                                   POPC R7, R5 ;
                                   S2R R10, SR_LTMASK ;
                                   ISETP.EQ.U32.AND P0, PT, R8, R9, PT ;
                               @P0 ATOMS.ADD R7, [RZ], R7 ;
                                   LOP3.LUT R10, R10, R5, RZ, 0xc0, !PT ;
                                   POPC R10, R10 ;
                                   SHFL.IDX PT, R4, R7, R8, 0x1f ;
                                   IMAD.IADD R4, R4, 0x1, R10 ;

is very different from the atomicInc_block:

//## File "testAtomicInc.cu", line 25
                                   IMAD.MOV.U32 R3, RZ, RZ, -0x1 ;
                                   ATOMS.INC R2, [RZ], R3 ;

The former seems much more complicated, and I don’t really know what is going on there…

One more thing: building the example with ‘-Xptxas -O0’ results in a different disassembly:

/*06f0*/                   MOV R2, RZ ;
        /*0700*/                   MOV R2, R2 ;
        /*0710*/                   MOV R3, 0x1 ;
        /*0720*/                   ATOMS.ADD R2, [R2], R3 ;

and the resulting executable does work as expected, without reporting any problems to synchceck.

And I just found out that the solution described at CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics using collaborative groups works as well.

As you’ve already discovered, the compiler may choose to implement warp-aggregation of atomics.

That’s evidently what is going on in your SASS snippet. I haven’t studied the rest of this thread, however.

Hi Robert,
thanks for confirming the behaviour.

I would say the question now is whether on the Volta architecture the generated object code is safe (and synccheck is reporting a false positive on Volta) or not (and ptxas should not generate this object code for sm_70).

I doubt that the synccheck report could be characterized as a “false positive”. I’m reasonably sure that it is an accurate reflection of the underlying convergence state.

I’m interested in the statement about variable or incorrect behavior under volta. I can reproduce the behavior of the synccheck error. However my code to do the atomics always gives the correct result for the atomic as well as the old values.

Can you provide a short, complete reproducer that clearly indicates the variable behavior you referred to on Volta? Please don’t redirect me to the link you’ve already provided. Please provide a short reproducer that focuses exactly on this variability and prints it out clearly so I can see it with no ambiguity or interpretation needed. Please edit this code directly into this thread; I don’t want to have to access an external link.

If you can demonstrate the variable behavior, then it seems evident to me that the code is unsafe.

To recap; in my view the desired behavior for the cuda-memcheck tool is that it should report no errors for valid code. I’m able to reproduce something that looks counter to that idea, and I’ve already filed an internal bug to have that investigated to see if my thinking is correct.

The more serious issue, in my opinion, if it can be demonstrated, would be incorrect functional behavior of the code. So far I don’t have my hands on a reproducer for that, and I would like to see it if one can be crafted.

Hi Robert,
I have not been able to build a test case that shows the incorrect behaviour.

I will be interested to hear if somebody at NVIDIA can figure out whether this is a false positive in cuda-memcheck, or a real issue in ptxas.

Thank you,
.Andrea

Yes, the issue around cuda-memcheck synccheck tool in the presence of compiler-generated aggregated atomics is a tools issue; it is expected to be rectified in a future CUDA release. As far as that report goes, it is not an accurate indication of an underlying code defect.

Just for the record, this seems to have been fixed using the NVIDIA drivers version 418.40 .