Out of bound memory read in kernel does not result in error

nastaran.kianersi · May 20, 2024, 2:46pm

I noticed (by chance) that if an out-of-bound memory read happens inside a kernel, the kernel will not return a cudaError. I used this code snippet for tests on Windows 11 with CUDA Toolkit v12.4 and an RTX 2050 GPU (cc 8.6).

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <iostream>
#include <stdio.h>
#include <vector>

#define gpuErrchk() { gpuAssert(__FILE__, __LINE__); }
inline void gpuAssert(const char* file, int line, bool abort = true)
{
    cudaDeviceSynchronize();
    cudaError_t code = cudaGetLastError();
    if (code != cudaSuccess)
    {
        fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if (abort) exit(code);
    }
}

__global__ void copy(int* output, const int* input, const int len)
{
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    while (i < len)
    {
        output[i] = input[i];
        i += (blockDim.x * gridDim.x);
    }
}

int main()
{
    cudaError_t cudaStatus = cudaSetDevice(0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        return 1;
    }

    const int len{ (1 << 20) };
    int* input = new int[len];
    for (int i{}; i < len; ++i) input[i] = 5;
    int* output = new int[len];
    
    int* d_input, * d_output;
    cudaMalloc(&d_input, sizeof(int) * len);
    cudaMalloc(&d_output, sizeof(int) * len);
    gpuErrchk();

    cudaMemcpy(d_input, input, len * sizeof(int), cudaMemcpyHostToDevice);
    copy << <32, 768 >> > (d_output, d_input, len);
    gpuErrchk();

    cudaMemcpy(output, d_output, len * sizeof(int), cudaMemcpyDeviceToHost);
    gpuErrchk();

    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;
    }

    return 0;
}

The above code works fine and no error is expected. Now, I change the code such that an out-of-bound mem access (read) happens and expect “an illegal memory access was encountered” error by the gpuErrchk() after the copy kernel. I change

copy << <32, 768 >> > (d_output, d_input, len);

To

copy << <32, 768 >> > (d_output, d_input + 1, len);

and I expect the illegal mem access error because there’s an out-of-bound memory read but this doesn’t happen. However, if I do the exact same thing on the output, i.e. change the kernel call to

copy << <32, 768 >> > (d_output + 1, d_input, len);

it results in an illegal memory access error! What’s the deal here? If it had been pure C++, I would’ve said it’s undefined behavior and we may or may not get a seg fault. But as for CUDA, I thought these errors were detected by the cudaGetLastError() API.

So to wrap it up, can anyone explain this observation? Is it my error checking method, some wrong expectation, my platform, or is it just a bug?

Thank you.

Robert_Crovella · May 20, 2024, 4:19pm

The GPU doesn’t detect an out of bounds error at the hardware level unless the out-of-bounds extent is enough to cross some unpublished boundary. Even with cudaGetLastError. So conceptually, it is similar to your statement about “pure C++”:

runtime-detected error

But the compute-sanitizer tool does additional checking and should flag any out of bounds error, no matter how small.

nastaran.kianersi · May 20, 2024, 4:35pm

Thank you for your reply. What about writing to memory? Does this statement hold for memory writes as well? Because in this example, writing to d_output[len] does result in an error even though it is only 1 index further.

Also, I have a similar issue with cudaMallocAsync(). If I allocate d_output using cudaMalloc() (like I have now), the error with be detected. However, if I allocate it with cudaMallocAsync() and leave everything else untouched, it will not be detected… Would you please give your opinion on this as well? Please let me know if it’s necessary to edit this question or create a separate topic.

Robert_Crovella · May 20, 2024, 5:49pm

The exact behavior in all these cases is not specified by NVIDIA.

I will just restate what I said (what you said) already. The correct mental model for all of these cases is:

if you want to try to eliminate such things from your code, one possible tool is compute-sanitizer.

Topic		Replies	Views
Obvious error cannot be detected through cudaGetLastError() CUDA Programming and Performance cuda , nvcc	2	919	July 5, 2021
read access violation - does it exist? CUDA Programming and Performance	6	1027	September 5, 2018
cuda-memcheck error: Address is out of bounds. CUDA Programming and Performance	2	5880	November 12, 2012
Tracking Invalid read size and illegal memory access CUDA Programming and Performance	3	7826	May 24, 2016
How to check GPU kernel's error? CUDA Programming and Performance	7	5069	January 22, 2019
zero-copy Out of bound access CUDA Programming and Performance	2	2416	July 7, 2009
CUDA Shared memory out-of-bound access CUDA Programming and Performance	1	3424	April 6, 2012
Alternately Encountering "Illegal Memory Access" Legacy PGI Compilers	4	6712	August 28, 2017
Memcheck CUDA Programming and Performance	2	593	July 20, 2017
CUDA and exceptions CUDA Programming and Performance	1	1817	August 5, 2010

Out of bound memory read in kernel does not result in error

Related topics