Illegal memory access with V100-SXM2, but not K80 or GRID520

seanmpack · March 2, 2021, 7:50pm

@striker159
I had trouble getting cuda-memcheck working, but succeeded by setting the environment variable CUDA_MEMCHECK_PATCH_MODULE=1, as discussed in this post.

Then, following the debugging guidance in this post and this post, I was able to track the error down to a specific line in one of our global functions.

The specific error was (and there were many other threads with this error):

========= Invalid global write of size 8
========= by thread (15, 0, 0) in block (856, 0, 0)
========= Address … is out of bounds

That global function is launched by:

blockSize = 256;
numBlocks = numSMs * 32
init_shift<<numBlocks, blockSize>>(ngy, ngx, grdCO_snc, grd_sync, grdCO, grd);

Here’s the global function with the line where the error occurs commented:

__global__
void init_shift(float ngy_d, float ngx_d, cuComplex *grdCO_snc_d, cuComplex *grd_snc_d, cuComplex *grdCO_d, cuComplex *grd_d)
{
   unsigned long int n = (unsigned long int)ngy_d * (unsigned long int)ngx_d;
   int index = blockIdx.x * blockDim.x + threadIdx.x;
   int stride = blockDim.x * gridDim.x;

   for (unsigned long int i = index; i < n; i += stride)
  {
     unsigned long int colcnt = (unsigned long int)(floor(i / ngy_d));
     unsigned long int rowcnt = (unsigned long int)(i - colcnt * ngy_d);

  // shift function
  unsigned long int xshift, yshift;
  if(colcnt < floor(ngx_d / 2))
  {
     xshift = colcnt + (unsigned long int)(ceil(ngx_d / 2));
  }
  else
  {
     xshift = colcnt - (unsigned long int)(floor(ngx_d / 2));
  }

  if(rowcnt < floor(ngy_d / 2))
  {
     yshift = rowcnt + (unsigned long int)(ceil(ngy_d / 2));
  }
  else
  {
     yshift = rowcnt - (unsigned long int)(floor(ngy_d / 2));
  }

  unsigned long int ishift = xshift * (unsigned long int)ngy_d + yshift;

  float realtemp = grdCO_snc_d[i].x;
  float imagtemp = grdCO_snc_d[i].y;

  grdCO_d[ishift].x = realtemp;
  grdCO_d[ishift].y = imagtemp; //***** ERROR OCCURS HERE *****

  realtemp = grd_snc_d[i].x;
  imagtemp = grd_snc_d[i].y;
  grd_d[ishift].x = realtemp;
  grd_d[ishift].y = imagtemp;
   }
}

I recognize that it is probably related to the independent thread scheduling used by Volta, but I’m not clear on how that plays out in this code.

Topic		Replies	Views
Memory Access Error with C++ Parallel Algorithm on NVIDIA HPC SDK, but not with CUDA nvc, nvc++ and nvfortran cuda	11	1110	October 9, 2023
Cuda code performance CUDA Programming and Performance	14	3147	December 16, 2014
Reading from float4 corrupts memory CUDA Programming and Performance cuda	18	52	October 3, 2024
CUDA and Image Processing CUDA Programming and Performance	45	22129	August 5, 2008
multi dimension array CUDA Programming and Performance	26	32775	February 12, 2010
Processing image with a CUDA kernel gives me different result than a seemingly equivalent CPU function CUDA Programming and Performance opencv , cuda	27	1467	May 8, 2024
Reduction CUDA Programming and Performance	19	3465	May 16, 2012
A more accurate, performance-competitive implementation of expf() CUDA Programming and Performance	24	8231	November 19, 2017
CUDA thrust compilation error CUDA Programming and Performance	85	342	April 4, 2025
Random "illegal memory accesses" with untiled 2D convolution algorithm CUDA Programming and Performance	5	911	October 12, 2021

Illegal memory access with V100-SXM2, but not K80 or GRID520

Related topics