Illegal memory access with unified memory

Hi,

I’ve encountered a puzzling issue with unified memory on certain GPUs and was hoping to get some help. There’s a reproducible attached, but essentially the test performs various allocations using cudaMallocManaged and writes to some of these buffers on the CPU and GPU.

The test app will eventually encounter the bellow error when calling cudaDeviceSynchronize() immediately after calling test_kernel

CUDA Runtime Error: an illegal memory access was encountered at test.cpp:69

I’ve run the test app with cuda-gdb and get the below:

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x8fd6a8 (test.cpp:58)

Thread 1 "test" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 527, block (26703,0,0), thread (0,0,0), device 0, sm 0, warp 8, lane 0]

The pattern and size of the allocations seems to be important in order to reproduce the issue.

I’ve only seen this on Quadro M6000 24gb GPUs. I can reproduce the issue on several machines, so I don’t believe this is due to a faulty card.

I’ve also tried upgrading to the latest cuda drivers: 530.30.02

Are there any known issues with UVM on Maxwell cards that would explain this?

Environment:

  • Driver Versions:
    • 510.47.04
    • 530.30.02
  • CentOS Linux release 7.9.2009
  • GPUs (on different machines)
    • Quadro M6000 24GB
  • CUDA Toolkits:
    • 11.2.2
    • 11.6.2
    • 11.8
  • GCC version: 9.3.1 20200408 (Red Hat 9.3.1-2)

Thanks,

cuda_uvm_issue.zip (2.8 KB)

Please post code inline on these forums. It makes it easier to discuss and is searchable that way.

You never free this allocation:

        CHECK_CUDA(cudaMallocManaged(&test, 1024 * 52 * sizeof(uint64_t)));

so eventually you are going to run out of memory.

Your CHECK_CUDA macro is also questionable. If an error occurs, it will rerun the command. I don’t think that is central to your question, but is not a design I would suggest.

Thanks for pointing out the CHECK_CUDA issue - I’ve fixed the test app.

The allocation that you mentioned only happens once because of the if (test == nullptr) check and test is a static variable.

Posting the code inline as requested:

#include "test.h"

#include <iostream>

struct Allocs
{
    static constexpr int N = 45;
    Allocs()
    {
        for (int i = 0; i < N; ++i)
        {
            CHECK_CUDA(cudaMallocManaged(&buffers[i], sizes[i]));
        }
    }

    ~Allocs()
    {
        for (int i = N-1; i >=0; --i)
        {
            CHECK_CUDA(cudaFree(buffers[i]));
        }
    }
    int sizes[N] = {65536,131072,262144,98304,196608,524288,262144,393216,786432,393216,1048576,524288,786432,16384,32768,65536,65536,131072,2293760,4587520,32768,65536,1146880,2293760,196608,393216,196608,65536,131072,98304,196608,393216,1146880,98304,196608,393216,1146880,65536,131072,163840,327680,65536,196608,131072,98304};
    void* buffers[N] = {};
};

__global__ void test_kernel(double* data, int count) {
    auto idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < count)
    {
        auto &value = data[idx];
        value = 0;
    }
}

void test_writes()
{
    CHECK_CUDA(cudaDeviceSynchronize());
    CHECK_LAST_CUDA_ERROR();

    double* buffer1;
    int count1 = 1 << 23;
    CHECK_CUDA(cudaMallocManaged(&buffer1, count1 * sizeof(double)));

    double* buffer2;
    int count2 = 1 << 23;
    CHECK_CUDA(cudaMallocManaged(&buffer2, count2 * sizeof(double)));
    // Write to unified memory on CPU
    for (int i = 0; i < count2; ++i)
    {
        buffer2[i] = double(i);
    }

    double* buffer3;
    int count3 = 1710050;
    CHECK_CUDA(cudaMallocManaged(&buffer3, count3 * sizeof(double)));

    // Write to unified memory on GPU
    std::cout << "Run test kernel and sync...." << std::endl;
    test_kernel<<<26720, 64>>>(buffer3, count3);

    CHECK_CUDA(cudaDeviceSynchronize());
    CHECK_LAST_CUDA_ERROR();

    // Free buffers
    CHECK_CUDA(cudaFree(buffer3));
    CHECK_CUDA(cudaFree(buffer2));
    CHECK_CUDA(cudaFree(buffer1));

    {
        Allocs allocs;

        static const uint64_t* test = nullptr;
        if (test == nullptr)
        {
            std::cout << "cudaMallocManaged temp" << std::endl;
            CHECK_CUDA(cudaMallocManaged(&test, 1024 * 52 * sizeof(uint64_t)));
        }
    }
}

int main()
{   
    int i = 0;
    while(true)
    {
        ++i;
        std::cout << "Count: " << i << std::endl;
        test_writes();
    }
    return 0;
}

I’m able to see the issue as well on a GTX 970 (Maxwell). It happened for me after 736 iterations (first try) or 888 iterations (second try). Curiously, if I run it under compute-sanitizer, it runs for at least 3000 iterations without error.

I suggest filing a bug.

Thanks for confirming that. Interesting to know the issue happens on another Maxwell GPU.

Bug filed:

https://developer.nvidia.com/nvidia_bug/4157877