Misaligned double atomic, but only when NSight is debugging remotely

I’m doing a simple atomicAdd using doubles (win7/vs2015sp3/nsight 5.3)), with NSight debugging onto a remote machine. The remote card (1060) has 6.1 capability, which should be fine with a double atomicAdd. However NSight complains with a misaligned atomic. BTW, during the run, I requested the card’s name, and its the right one. But if I locally run the exact same code on the (previously) remote machine, all’s fine. Whats happening here?

misaligned atomic, this error comes from nsight memcheck, could you have a check when the nsight memcheck is off, also could you tell me which driver you use.

Thanks for replying…
If I uncheck the memory checker option under NSight, then there’s no error notifications.
Video driver is 384.76.

Em, please check if your code has memory leak issue, I will try to repo it locally to identify if it’s a bug.

Also could you please use cuda-memcheck not the memcheck in nsight to check your app and find out the difference?

So I’ve simplified everything down to just NSight running locally on one machine, with one 1060 running the minimal code below. With command line cuda-memcheck, all’s fine. The only issues I have is when the VS inbuilt NSight (running locally) has memory checker enabled. Also if I NSight to a remote machine, the behaviour remains.

No memory fails too, as you can see.

global void why(
double *dd,
float *ff
float thingFloat = 3.0f;
float qf = atomicAdd(ff, thingFloat);

double thingDouble = 3.0f;
atomicAdd(dd, thingDouble);


int main()
double *dd;
cudaMalloc((void **)&dd, 100 * sizeof(double));
float *ff;
cudaMalloc((void **)&ff, 100 * sizeof(float));

why << <1, 1 >> > (dd, ff);


return 0;


OK, I get it, it only happens on remote debugging on 1060, right?

Ive got it happening locally all on one system now, with a 1060 installed (alongside an old quadro k2000).
But if I point it off remotely to a similar system, it still fails. So not the card at least.

Confirmed, bug has been raised.

No repo in our internal build, should be fixed in nsight 5.6

I think I ran into the same issue in my code. The issue is still present after updating NSight to 5.6.
I’m also able to reproduce the issue, using the code posted in message #5.

CUDA Memory Checker detected 1 threads caused an access violation:
Launch Parameters
    CUcontext    = 1e491e60
    CUstream     = 12901af80
    CUmodule     = 3941f180
    CUfunction   = 1318152f0
    FunctionName = _Z3whyPdPf
    GridId       = 6
    gridDim      = {1,1,1}
    blockDim     = {1,1,1}
    sharedSize   = 256
        dd = 0x0000007009600600  0
        ff = 0x0000007009600a00  3
    Parameters (raw):
         0x09600600 0x00000070 0x09600a00 0x00000070
GPU State:
   Address  Size      Type  Mem       Block  Thread         blockIdx  threadIdx                                                                                                    PC  Source
7009600600     0  mis atom    g           0       0          {0,0,0}    {0,0,0}  _ZN74_INTERNAL_52_tmpxft_00002da4_00000000_12_spcov_compute_60_cpp1_ii_d4ad2c259atomicAddEPdd+0001b0  c:\program files\nvidia gpu computing toolkit\cuda\v8.0\include\device_functions.hpp:1623

Summary of access violations:
c:\program files\nvidia gpu computing toolkit\cuda\v8.0\include\device_functions.hpp(1623): error MemoryChecker: #misaligned=1  #invalidAddress=0

Memory Checker detected 1 access violations.
error = misaligned atomic (global memory)
gridid = 6
blockIdx = {0,0,0}
threadIdx = {0,0,0}
address = 0x7009600600
accessSize = 0

My setup is: Windows 10, Visual Studio 2013 update5, cuda_8.0.61_win10 + cuda_8.0.61.2_windows, GTX1080Ti, 384.94 driver.

I’m using “legacy debugger”, because System Requirements for Nsight says that “next gen” is not supported on Pascal & WDDM at all. And I’m using r384 driver because “Launch the CUDA Debugger” section says that “Legacy debugger” requires it.

harryz_, were you only talking about about “next gen” debugger?

I do remember it disappeared on cuda 9.0, anyway I’ve checked it on cuda 9.2 and it still exists and only happens on atomicAdd for double, I will raise a bug to track it.

Thanks, hope it would get fixed for CUDA 8.0 as well.