How to interpret cudaMemCheck output of access violation?

I’m trying to debug my program which is detected 19 memory access violation. However, since only some of the threads seems to cause the violation, others works fine. I don’t know how to interpret the output message of cudaMemCheck. Can anyone give a hint?

================================================================================
CUDA Memory Checker detected 19 threads caused an access violation:
Launch Parameters
    CUcontext    = 1414b2b6a70
    CUstream     = 1414d6814d0
    CUmodule     = 14158bedf60
    CUfunction   = 14158bcedd0
    FunctionName = _Z11colorKernelibdiiPdS_S_S_PfS_S_S_P7Vector3S2_S2_S2_S2_S2_
    GridId       = 14
    gridDim      = {32,32,1}
    blockDim     = {16,16,1}
    sharedSize   = 256
    Parameters:
        SPEC_TOTAL_SAMPLES = 5
        QUALITY_ROOM = false
        LeScale = 0.25
        SPEC_SAMPLE_STEP = 15
        CHROMA = 100
        dev_le = 0x0000000500820000  4.92573287131384e-47
        dev_l = 0x0000000500401200  -6.27743856220419e+66
        dev_le_mean = 0x0000000500401400  32999589.2369607
        dev_temperature_grid = 0x0000000500720000  293.15
        dev_image = 0x0000000500500000  0.24825993
        dev_xm = 0x0000000500400c00  15.0002254584219
        dev_ym = 0x0000000500400e00  9.99909083818233
        dev_zm = 0x0000000500401000  14.9976030252868
        dev_eyepos = 0x0000000500400000  {x = 2, y = 2, z = 7.5}
        dev_forward = 0x0000000500400200  {x = 0, y = 0.341743063086704, z = -0.939793423488437}
        dev_right = 0x0000000500400400  {x = 1, y = 0, z = 0}
        dev_up = 0x0000000500400600  {x = 0, y = 0.939793423488437, z = 0.341743063086704}
        dev_minCoord = 0x0000000500400800  {x = 0, y = 0, z = 0}
        dev_maxCoord = 0x0000000500400a00  {x = 4, y = 4, z = 4}
    Parameters (raw):
         0x00000005 0x00000100 0x00000000 0x3fd00000
         0x0000000f 0x00000064 0x00820000 0x00000005
         0x00401200 0x00000005 0x00401400 0x00000005
         0x00720000 0x00000005 0x00500000 0x00000005
         0x00400c00 0x00000005 0x00400e00 0x00000005
         0x00401000 0x00000005 0x00400000 0x00000005
         0x00400200 0x00000005 0x00400400 0x00000005
         0x00400600 0x00000005 0x00400800 0x00000005
         0x00400a00 0x00000005
GPU State:
   Address  Size      Type  Mem       Block  Thread         blockIdx  threadIdx                                                                   PC  Source
------------------------------------------------------------------------------------------------------------------------------------------------------------
  00000000     8    adr st    g         592      77        {16,18,0}   {13,4,0}  _Z11colorKernelibdiiPdS_S_S_PfS_S_S_P7Vector3S2_S2_S2_S2_S2_+005350  e:\github\pvr\pvr\renderer.cu:292
  00000000     8    adr st    g         592      78        {16,18,0}   {14,4,0}  _Z11colorKernelibdiiPdS_S_S_PfS_S_S_P7Vector3S2_S2_S2_S2_S2_+005350  e:\github\pvr\pvr\renderer.cu:292
  00000000     8    adr st    g         592      79        {16,18,0}   {15,4,0}  _Z11colorKernelibdiiPdS_S_S_PfS_S_S_P7Vector3S2_S2_S2_S2_S2_+005350  e:\github\pvr\pvr\renderer.cu:292
  00000000     8    adr st    g         592      80        {16,18,0}    {0,5,0}  _Z11colorKernelibdiiPdS_S_S_PfS_S_S_P7Vector3S2_S2_S2_S2_S2_+005350  e:\github\pvr\pvr\renderer.cu:292
  00000000     8    adr st    g         592      81        {16,18,0}    {1,5,0}  _Z11colorKernelibdiiPdS_S_S_PfS_S_S_P7Vector3S2_S2_S2_S2_S2_+005350  e:\github\pvr\pvr\renderer.cu:292
  00000000     8    adr st    g         592      82        {16,18,0}    {2,5,0}  _Z11colorKernelibdiiPdS_S_S_PfS_S_S_P7Vector3S2_S2_S2_S2_S2_+005350  e:\github\pvr\pvr\renderer.cu:292
  00000000     8    adr st    g         592      83        {16,18,0}    {3,5,0}  _Z11colorKernelibdiiPdS_S_S_PfS_S_S_P7Vector3S2_S2_S2_S2_S2_+005350  e:\github\pvr\pvr\renderer.cu:292
  00000000     8    adr st    g         592      84        {16,18,0}    {4,5,0}  _Z11colorKernelibdiiPdS_S_S_PfS_S_S_P7Vector3S2_S2_S2_S2_S2_+005350  e:\github\pvr\pvr\renderer.cu:292
  00000000     8    adr st    g         592      85        {16,18,0}    {5,5,0}  _Z11colorKernelibdiiPdS_S_S_PfS_S_S_P7Vector3S2_S2_S2_S2_S2_+005350  e:\github\pvr\pvr\renderer.cu:292
  00000000     8    adr st    g         592      86        {16,18,0}    {6,5,0}  _Z11colorKernelibdiiPdS_S_S_PfS_S_S_P7Vector3S2_S2_S2_S2_S2_+005350  e:\github\pvr\pvr\renderer.cu:292
  00000000     8    adr st    g         592      87        {16,18,0}    {7,5,0}  _Z11colorKernelibdiiPdS_S_S_PfS_S_S_P7Vector3S2_S2_S2_S2_S2_+005350  e:\github\pvr\pvr\renderer.cu:292
  00000000     8    adr st    g         592      88        {16,18,0}    {8,5,0}  _Z11colorKernelibdiiPdS_S_S_PfS_S_S_P7Vector3S2_S2_S2_S2_S2_+005350  e:\github\pvr\pvr\renderer.cu:292
  00000000     8    adr st    g         592      89        {16,18,0}    {9,5,0}  _Z11colorKernelibdiiPdS_S_S_PfS_S_S_P7Vector3S2_S2_S2_S2_S2_+005350  e:\github\pvr\pvr\renderer.cu:292
  00000000     8    adr st    g         592      90        {16,18,0}   {10,5,0}  _Z11colorKernelibdiiPdS_S_S_PfS_S_S_P7Vector3S2_S2_S2_S2_S2_+005350  e:\github\pvr\pvr\renderer.cu:292
  00000000     8    adr st    g         592      91        {16,18,0}   {11,5,0}  _Z11colorKernelibdiiPdS_S_S_PfS_S_S_P7Vector3S2_S2_S2_S2_S2_+005350  e:\github\pvr\pvr\renderer.cu:292
  00000000     8    adr st    g         592      92        {16,18,0}   {12,5,0}  _Z11colorKernelibdiiPdS_S_S_PfS_S_S_P7Vector3S2_S2_S2_S2_S2_+005350  e:\github\pvr\pvr\renderer.cu:292
  00000000     8    adr st    g         592      93        {16,18,0}   {13,5,0}  _Z11colorKernelibdiiPdS_S_S_PfS_S_S_P7Vector3S2_S2_S2_S2_S2_+005350  e:\github\pvr\pvr\renderer.cu:292
  00000000     8    adr st    g         592      94        {16,18,0}   {14,5,0}  _Z11colorKernelibdiiPdS_S_S_PfS_S_S_P7Vector3S2_S2_S2_S2_S2_+005350  e:\github\pvr\pvr\renderer.cu:292
  00000000     8    adr st    g         592      95        {16,18,0}   {15,5,0}  _Z11colorKernelibdiiPdS_S_S_PfS_S_S_P7Vector3S2_S2_S2_S2_S2_+005350  e:\github\pvr\pvr\renderer.cu:292


Summary of access violations:
e:\github\pvr\pvr\renderer.cu(292): error MemoryChecker: #misaligned=0  #invalidAddress=32
================================================================================

Memory Checker detected 19 access violations.
error = access violation on store (global memory)
gridid = 14
blockIdx = {16,18,0}
threadIdx = {13,4,0}
address = 0x00000000
accessSize = 8

I looks like these are all instances of the same store to global memory in the code, just encountered in multiple threads in one block during run time. You would want to take a critical look at

e:\github\pvr\pvr\renderer.cu:292

that is, line 292 of source file renderer.cu, and look at the writes occuring in that line (likely just a single one). The thread index and block index data may help you figure out why the addresses are out of bounds (or use them to selectively add logging printfs).

@njuffa
Thanks for the comment. But at line 292 is a initialization of a pointer to double:

double *local_L = new double[SPEC_TOTAL_SAMPLES];

So I’m not trying to access some value in an array. And I think every threads should execute the exact same expression. So why only those threads(in the output) encounter the problem? Is it possible that it reaches the memory limit?

Assuming that line 292 is the correct line (and the posted output isn’t missing a few columns, with the line number actually being reported as 292x) it may be the case that the offending write is internal to new (allocation too large?). Or maybe the line number is misreported because this is a release build and not a debug build?

It is very difficult to give advice without seeing the actual code. If this were my code base, I would instrument the code in the vicinity heavily to find out what going on. In my experience, cuda-memcheck never reports false positives for out-of-bounds accesses, so if it says there is a global store out of bounds, I believe it.

@njuffa.
Well, I found that the dynamic allocation seems not success, which causes the out-of-bound access. I tried to use static allocation

double local_L[5];

then it works fine. So maybe the memory does run out? If it does, how should I fix it then, if I have to use dynamic allocation? It seems reducing number of threads per block doesn’t work.

read about in-kernel dynamic allocation in the prgramming guide. it is limited by a heap size which is modifiable

I don’t recall the details of device-side “new”, but check whether there are applicable limits you can increase with cudaSetLimit().

Also note that exception handling is not supported on the device side.
The addresses of the invalid memory addresses are all NULLs, which is the value returned by a failed call to malloc(). On the host side this would throw an exception, but as exceptions are not supported on the device, you get the NULL returned from the “new” operator.