Illegal memory access with V100-SXM2, but not K80 or GRID520

I’m trying to transition our CUDA code to a new AWS instance (P3) that uses a V100-SXM2 (Volta) GPU, but am now encountering illegal memory access errors with parts of our code. This code runs fine on a K80 (Tesla), and in the past also ran fine on a GRID520 (Kepler). I’m wondering if there’s something about the Volta architecture that requires a change in our CUDA code?

current configuration that works:
AWS p2.2xlarge
CentOS 7.9.2009
kernel 3.10.0-1160.6.1el7.x86_64
K80 (Tesla)
CUDA 10.2 with patches 1 and 2
driver 440.33.01
g++ 4.8.5

new configuration that throws illegal memory access errors:
AWS p3.2xlarge
CentOS 7.9.2009
kernel 3.10.0-1160.6.1el7.x86_64
V100-SXM2 (Volta)
CUDA 10.2 with patches 1 and 2
driver 440.118.02
g++ 4.8.5

I have also tried CUDA 11.1.1/driver 455.32.00 with the V100, but get the same errors.

Given that the code has worked fine on two previous, different GPUs, I’m reluctant to embark on a bug hunt since our CUDA code has not changed. I have also looked through the Volta Tuning Guide, but haven’t seen anything that seems relevant to our code. However, I recognize that I could be missing something important.

Did you try to use cuda-memcheck to locate the error?

Volta introduced Independent Thread Scheduling, which may break older code.

1.4.3. Independent Thread Scheduling Compatibility

The Volta architecture introduces Independent Thread Scheduling among threads in a warp. If the developer made assumptions about warp-synchronicity, 1 this feature can alter the set of threads participating in the executed code compared to previous architectures. Please see Compute Capability 7.0 in the CUDA C++ Programming Guide for details and corrective actions. To aid migration Volta developers can opt-in to the Pascal scheduling model with the following combination of compiler options.

nvcc -arch=compute_60 -code=sm_70 …

@striker159 Thanks for your response. After reading about independent thread scheduling in the Volta Tuning Guide, I tried opting in to the Pascal scheduling model during compilation, but this had no effect. I was having trouble getting cuda-memcheck to run on our system yesterday, so have not yet gotten any detailed information from that tool. I will try to get that running today to see if I can get any meaningful information that might help point to a solution and post back here.

@striker159
I had trouble getting cuda-memcheck working, but succeeded by setting the environment variable CUDA_MEMCHECK_PATCH_MODULE=1, as discussed in this post.

Then, following the debugging guidance in this post and this post, I was able to track the error down to a specific line in one of our global functions.

The specific error was (and there were many other threads with this error):

========= Invalid global write of size 8
========= by thread (15, 0, 0) in block (856, 0, 0)
========= Address … is out of bounds

That global function is launched by:

blockSize = 256;
numBlocks = numSMs * 32
init_shift<<numBlocks, blockSize>>(ngy, ngx, grdCO_snc, grd_sync, grdCO, grd);

Here’s the global function with the line where the error occurs commented:

__global__
void init_shift(float ngy_d, float ngx_d, cuComplex *grdCO_snc_d, cuComplex *grd_snc_d, cuComplex *grdCO_d, cuComplex *grd_d)
{
   unsigned long int n = (unsigned long int)ngy_d * (unsigned long int)ngx_d;
   int index = blockIdx.x * blockDim.x + threadIdx.x;
   int stride = blockDim.x * gridDim.x;

   for (unsigned long int i = index; i < n; i += stride)
  {
     unsigned long int colcnt = (unsigned long int)(floor(i / ngy_d));
     unsigned long int rowcnt = (unsigned long int)(i - colcnt * ngy_d);

  // shift function
  unsigned long int xshift, yshift;
  if(colcnt < floor(ngx_d / 2))
  {
     xshift = colcnt + (unsigned long int)(ceil(ngx_d / 2));
  }
  else
  {
     xshift = colcnt - (unsigned long int)(floor(ngx_d / 2));
  }

  if(rowcnt < floor(ngy_d / 2))
  {
     yshift = rowcnt + (unsigned long int)(ceil(ngy_d / 2));
  }
  else
  {
     yshift = rowcnt - (unsigned long int)(floor(ngy_d / 2));
  }

  unsigned long int ishift = xshift * (unsigned long int)ngy_d + yshift;

  float realtemp = grdCO_snc_d[i].x;
  float imagtemp = grdCO_snc_d[i].y;

  grdCO_d[ishift].x = realtemp;
  grdCO_d[ishift].y = imagtemp; //***** ERROR OCCURS HERE *****

  realtemp = grd_snc_d[i].x;
  imagtemp = grd_snc_d[i].y;
  grd_d[ishift].x = realtemp;
  grd_d[ishift].y = imagtemp;
   }
}

I recognize that it is probably related to the independent thread scheduling used by Volta, but I’m not clear on how that plays out in this code.

grdCO_d[ishift].y = imagtemp; //***** ERROR OCCURS HERE *****

Print the index (ishift) and trace the problem back from there. Note that not all out-of-bounds errors may be detected reliably, so there may have been a latent bug that has been exposed now.

Maybe it is an off-by one error that occurs only occasionally due to the index being based on floating-point computation based on ngx_d and ngy_d? Could the index computation be changed to all integer computation?

I’m reluctant to embark on a bug hunt

I see no reason for that reluctance. Unless code has been proven correct, there is always some chance that it just happens to work, rather than work by design.

@njuffa Well, I was reluctant to embark on a bug hunt because this code has worked well for several years across several different GPUs and CUDA toolkit versions. That said, I understand your point that a latent bug could have been exposed via a change in GPU architecture.

Following your suggestions, I used some printf statements inside the kernel and found that ishift goes way out of bounds for some iterations of our for loop. I was able to trace the error back to the second else statement, where yshift gets calculated as an extremely large value for some iterations.

yshift = rowcnt - (unsigned long int)(floor(ngy_d / 2));

In an effort to further track the error, I tried adding some more printf statements inside that else statement, but after the yshift calculation. Oddly, the addition of the printf statement enabled the code to run to completion with no errors. I don’t really understand this, but it feels like a thread synchronization issue. To test this, I replaced the printf with __syncthreads(), and this also ran error free. Running the cuda-memcheck synccheck tool revealed no errors, which I think suggests that the addition of __syncthreads() is appropriate.

Although I’ve worked with this specific code for a while now, I’m definitely a CUDA novice. As such, I don’t fully understand why this remedy worked. Does it make sense, or have I just gotten lucky?

The use of printf() or __syncthreads() will change the generated code, and in particular could affect the order of specific instructions. This could, for example, hide a race condition or a code generation bug. All operations in the particular misbehaving computation map to simple machine instructions which makes a code generation bug unlikely (but doesn’t rule it out completely). I don’t see the need for a __syncthreads() call anywhere in this kernel, so would advise against adding it.

You say that yshift sometimes turns into an extremely large value. What is that value? Is it always the same value? For example, it could be indicative of an overflow in the float to unsigned long int conversion. Does the large value of yshift occur for particular values of ngy_d? This variable is a kernel argument, you could print it from host code.

BTW, have you checked on the value of rowcnt? It seems to me that i - colcnt * ngy_d could occasionally return a negative result due to rounding in the floating-point computation. A negative result would turn into a large positive value when assigned to an unsigned integer type. A float only carries 24 bits of precision, which may not be sufficient when trying to accurately operate on (at least!) 32-bit integers. The mixing of float and integer computation in this code makes me nervous to say the least. In the multiplication colcnt * ngy_d, colcnt is converted and rounded to a float with potential loss of information.

Quick sanity check: You are not compiling this code with -prec-div=false or -use_fast_math, correct?

A word of general advice: CUDA follows the type system of the host platfom. long int and unsigned long int are 32-bit types on Windows platforms and 64-bit types on Linux platforms. Due to this ambiguity, I would strongly advise against use of long types in CUDA code. Use long long types where you need 64-bit integers, otherwise use plain {unsigned} int. I would assume an unsigned int would be sufficient here, but this is your use case not mine.

A quick (but not conclusive) check on a potential code generation issue would be to lower the ptxas (compiler backend) optimization level. The nvcc default is -Xptxas -O3. You can try lowering that to -Xptxas -O2 etc. down to -O0. If the problem goes away as the optimization level is lowered, that could hint at a code generation issue, but just as with an inserted function call it could also just sufficiently modify the instruction sequence to mask some other issue.

I compiled the code for sm_70 with CUDA 9.2, and the generated machine code is unsuitable for human consumption without investing several hours into back annotation.

Below is a simple C test program that shows how the rowcnt computation can fail due to the limited precision of float computation. I ran this on a Windows machine where sizeof (unsigned long) == 4, and ensured that ngx and ngy are integers < 216. The first failing case it reports is:

ngx=63790 ngy=7736 n=493479440 i=253500969 colcnt=32769 rowcnt=4294967280

Note that the problem could be exacerbated when unsigned long is an 64-bit type as the precision mismatch with float would be even more severe for large values if i.

Typically ngx and ngy denote the “number of grid points in x-/y-direction”, so the most appropriate type for these would be an integer type, and the entire indexing computation should be integer based.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <mathimf.h>

// George Marsaglia's KISS PRNG, period 2**123. Newsgroup sci.math, 21 Jan 1999
// Bug fix: Greg Rose, "KISS: A Bit Too Simple" http://eprint.iacr.org/2011/007
static uint32_t kiss_z=362436069, kiss_w=521288629;
static uint32_t kiss_jsr=123456789, kiss_jcong=380116160;
#define znew (kiss_z=36969*(kiss_z&65535)+(kiss_z>>16))
#define wnew (kiss_w=18000*(kiss_w&65535)+(kiss_w>>16))
#define MWC  ((znew<<16)+wnew )
#define SHR3 (kiss_jsr^=(kiss_jsr<<13),kiss_jsr^=(kiss_jsr>>17), \
              kiss_jsr^=(kiss_jsr<<5))
#define CONG (kiss_jcong=69069*kiss_jcong+1234567)
#define KISS ((MWC^CONG)+SHR3)

int main (void) 
{
    float ngx, ngy;
    unsigned long int i, n, colcnt, rowcnt;

    do {
        do {
            ngx = KISS & 0xffff;
            ngy = KISS & 0xffff;
        } while ((ngy * ngx) >= 0x1.0p32);
        
        n = (unsigned long int)ngy * (unsigned long int)ngx;
        
        for (i = 1; i < n; i++) {
            colcnt = (unsigned long int)(floorf (i / ngy));
            rowcnt = (unsigned long int)(i - colcnt * ngy);
            
            if (rowcnt > 0x10000000) {
                printf ("ngx=%.0f ngy=%.0f n=%lu i=%lu colcnt=%lu (raw_colcount=%15.8e), rowcnt=%lu (raw_rowcount =%15.8e)\n", 
                        ngx, ngy, n, i, colcnt, floorf (i / ngy), rowcnt, i - colcnt * ngy);
                return EXIT_FAILURE;
            }
        }
    } while (1);
    return EXIT_SUCCESS;
}

Thanks to @njuffa 's list of possible problems, as well as thorough explanations, I think I have tracked the issue down to a rounding error in the floating point computation. As noted, in problematic iterations of the for loop, ishift would become a very large number.

The tipoff was this statement:
It seems to me that i - colcnt * ngy_d could occasionally return a negative result due to rounding in the floating-point computation. A negative result would turn into a large positive value when assigned to an unsigned integer type.

Indeed, ishift in all cases was very close to 18,446,744,073,709,551,615, or the upper limit of an unsigned long int on Linux platforms (64 bits). Tracking back from ishift, the problem began with the calculation of colcnt and was propagated through to rowcnt, then yshift and ishift. Based on possible values for i and ngy_d, colcnt should never fall below zero. Despite that, I think a casting error from float to unsigned long int must be responsible for it going negative, resulting in it wrapping around to a very large integer value.

To remedy this, I changed the colcnt and rowcnt calculations to:

unsigned int colcnt = static_cast<unsigned int>(i / ngy_d);
unsigned int rowcnt = static_cast<unsigned int>(i - colcnt * static_cast<unsigned int>(ngy_d));

These changes alone fixed the error, but I followed the advice to use unsigned int and long long int instead of unsigned long int. Here is the reworked code:

__global__
void init_shift(float ngy_d, float ngx_d, cuComplex *grdCO_snc_d, cuComplex *grd_snc_d, cuComplex *grdCO_d, cuComplex *grd_d)
{
   unsigned long long int n = static_cast<unsigned int>(ngy_d) * static_cast<unsigned int>(ngx_d);
   int index = blockIdx.x * blockDim.x + threadIdx.x;
   int stride = blockDim.x * gridDim.x;

   for (unsigned long long int i = index; i < n; i += stride)
   {
      unsigned int colcnt = static_cast<unsigned int>(i / ngy_d);
      unsigned int rowcnt = static_cast<unsigned int>(i - colcnt * static_cast<unsigned int>(ngy_d));

      // shift function
      unsigned int xshift, yshift;
      if(colcnt < floor(ngx_d / 2))
      {
         xshift = colcnt + static_cast<unsigned int>(ceil(ngx_d / 2));
      }
      else
      {
         xshift = colcnt - static_cast<unsigned int>(floor(ngx_d / 2));
      }

      if(rowcnt < floor(ngy_d / 2))
      {
         yshift = rowcnt + static_cast<unsigned int>(ceil(ngy_d / 2));
      }
      else
      {
         yshift = rowcnt - static_cast<unsigned int>(floor(ngy_d / 2));
      }

      unsigned int ishift = xshift * static_cast<unsigned int>(ngy_d) + yshift;

      float realtemp = grdCO_snc_d[i].x;
      float imagtemp = grdCO_snc_d[i].y;

      grdCO_d[ishift].x = realtemp;
      grdCO_d[ishift].y = imagtemp;

      realtemp = grd_snc_d[i].x;
      imagtemp = grd_snc_d[i].y;
      grd_d[ishift].x = realtemp;
      grd_d[ishift].y = imagtemp;
   }
}

And just to be thorough,

Quick sanity check: You are not compiling this code with -prec-div=false or -use_fast_math, correct?
That’s correct, I do not compile with either of those options.

A quick (but not conclusive) check on a potential code generation issue would be to lower the ptxas (compiler backend) optimization level. The nvcc default is -Xptxas -O3. You can try lowering that to -Xptxas -O2 etc. down to -O0. If the problem goes away as the optimization level is lowered, that could hint at a code generation issue, but just as with an inserted function call it could also just sufficiently modify the instruction sequence to mask some other issue.
I did not try changing the ptxas optimization level, but this is good to know as I continue to debug this code.

Thank you again for all your help. You’ve helped me learn a lot in the process.

Glad you were able to sort it out.

The problem is not in the casting. It is the insufficient precision of float and the rounding in floating-point operations. Going back to my earlier example:

ngx=63790 ngy=7736 n=493479440 i=253500969
colcnt=32769 (raw_colcount=3.27690000e+004)
rowcnt=4294967280 (raw_rowcount =-1.60000000e+001)

The “raw” values are the float values prior to the casts to unsigned long int. Both the raw colcnt and rowcnt are incorrect, and the latter winds up negative. Hilarity ensues.

Integer computation can be replaced by float computation only when the magnitude of source operands and results is tightly constrained (I think |x| < 223 may be sufficient, but haven’t looked at it with sufficient rigor), which is not the case here.

In terms of best practices, consider decorating your pointer arguments with const and __restrict__ as appropriate. It’s probably not going to help here since I would expect the code’s performance to be bound by memory bandwidth, but it is a good habit to get into.

1 Like