Illegal memory access with V100-SXM2, but not K80 or GRID520

@striker159
I had trouble getting cuda-memcheck working, but succeeded by setting the environment variable CUDA_MEMCHECK_PATCH_MODULE=1, as discussed in this post.

Then, following the debugging guidance in this post and this post, I was able to track the error down to a specific line in one of our global functions.

The specific error was (and there were many other threads with this error):

========= Invalid global write of size 8
========= by thread (15, 0, 0) in block (856, 0, 0)
========= Address … is out of bounds

That global function is launched by:

blockSize = 256;
numBlocks = numSMs * 32
init_shift<<numBlocks, blockSize>>(ngy, ngx, grdCO_snc, grd_sync, grdCO, grd);

Here’s the global function with the line where the error occurs commented:

__global__
void init_shift(float ngy_d, float ngx_d, cuComplex *grdCO_snc_d, cuComplex *grd_snc_d, cuComplex *grdCO_d, cuComplex *grd_d)
{
   unsigned long int n = (unsigned long int)ngy_d * (unsigned long int)ngx_d;
   int index = blockIdx.x * blockDim.x + threadIdx.x;
   int stride = blockDim.x * gridDim.x;

   for (unsigned long int i = index; i < n; i += stride)
  {
     unsigned long int colcnt = (unsigned long int)(floor(i / ngy_d));
     unsigned long int rowcnt = (unsigned long int)(i - colcnt * ngy_d);

  // shift function
  unsigned long int xshift, yshift;
  if(colcnt < floor(ngx_d / 2))
  {
     xshift = colcnt + (unsigned long int)(ceil(ngx_d / 2));
  }
  else
  {
     xshift = colcnt - (unsigned long int)(floor(ngx_d / 2));
  }

  if(rowcnt < floor(ngy_d / 2))
  {
     yshift = rowcnt + (unsigned long int)(ceil(ngy_d / 2));
  }
  else
  {
     yshift = rowcnt - (unsigned long int)(floor(ngy_d / 2));
  }

  unsigned long int ishift = xshift * (unsigned long int)ngy_d + yshift;

  float realtemp = grdCO_snc_d[i].x;
  float imagtemp = grdCO_snc_d[i].y;

  grdCO_d[ishift].x = realtemp;
  grdCO_d[ishift].y = imagtemp; //***** ERROR OCCURS HERE *****

  realtemp = grd_snc_d[i].x;
  imagtemp = grd_snc_d[i].y;
  grd_d[ishift].x = realtemp;
  grd_d[ishift].y = imagtemp;
   }
}

I recognize that it is probably related to the independent thread scheduling used by Volta, but I’m not clear on how that plays out in this code.