atomicMax broken with -O0 on CUDA 7

Hi all,

I’m having issues with atomicMax on the latest CUDA 7.

This happens when compiling my code with -O0 -G and setting some --maxrregcount.

My setup is:

[dcampora@gpu01 trackfollowing]$ uname -a
Linux gpu01 2.6.32-504.16.2.el6.x86_64 #1 SMP Tue Apr 21 21:44:51 CEST 2015 x86_64 x86_64 x86_64 GNU/Linux
[dcampora@gpu01 trackfollowing]$ nvidia-smi
Thu May 28 06:05:24 2015       
| NVIDIA-SMI 346.46     Driver Version: 346.46         |                       
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX 980     Off  | 0000:03:00.0     N/A |                  N/A |
|  0%   33C    P0    N/A /  N/A |     15MiB /  4095MiB |     N/A      Default |

Here’s a cuda-gdb run of my code:

(cuda-gdb) n
505               atomicMax(max_numhits_to_process, num_h1_to_process);
(cuda-gdb) n
508             __syncthreads();
(cuda-gdb) n
warning: Warp(s) other than the current warp had to be single-stepped.
511             for (int j=0; j<((int) ceilf(((float) (max_numhits_to_process[0])) / blockDim.y)); ++j) {
(cuda-gdb) p max_numhits_to_process
$6 = (@generic unsigned int * @register) 0x704f00018
(cuda-gdb) p max_numhits_to_process[0]
$7 = 0
(cuda-gdb) p num_h1_to_process   
$8 = 19

Any clues?

Can you provide a simple, complete test case that I could compile and run to demonstrate the issue?

Unfortunately no, I don’t have a simple test case, otherwise I would post it.

On top of that, it’s very hard to reproduce - I have encountered it very scarcely and I did some changes and “fixed” it, I’m guessing just by changing the control flow of the algorithm this internal data race or whatever it is is not triggered anymore.

If I get a full example, I’ll post it - but it won’t be simple.