cudaMemset: illegal memory access with RTX5090 with 570.86.16

  • Summary : illegal memory access with RTX5090 with 570.86.16
  • Relevant Area: usage detected in llama.cpp
  • NVIDIA GPU or System: GeForce RTX 5090
  • CUDA Version: 12.8
  • NVIDIA Software Version: NVIDIA-SMI 570.86.16
  • OS: Leap15.6 kernel 6.4 x86_64
  • Other Details: tested ok on Quadro P1000 4GB

I have done some testing of ollama and llama.ccp with the RTX5090. This has result in a bug around cudaMemset(). To have done some comparison testing of a simple program which load a gguf file into memory there is a bug in the Driver with the RTX5090 card. I have done some testing under win10 version 572 and didn’t have any issue. When ca we expect getting a update version of the driver?

on P1000:

aginies@linux-5530:~/testcuda> ./a.out -f qwen2.5-coder-3b-instruct-q4_0.gguf 
GPU Device 0: Quadro P1000
Total Global Memory: 4034 MB
Free Global Memory: 3994 MB
Memory allocated successfully on the GPU.
Memory initialized successfully on the GPU.
Data loaded from GGUF file to GPU memory.
Data is kept in GPU memory. Press Enter to exit...
Memory freed successfully on the GPU.

on RTX5090:

aginies@ryzen9:~/testcuda>  cuda-gdb --args ./a.out -f qwen2.5-coder-3b-instruct-q4_0.gguf
NVIDIA (R) cuda-gdb 12.8
....
Reading symbols from ./a.out...
(cuda-gdb) run 
Starting program: /home/aginies/testcuda/a.out -f qwen2.5-coder-3b-instruct-q4_0.gguf
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff31ff000 (LWP 4024)]
[New Thread 0x7ffff1dff000 (LWP 4025)]
[Detaching after fork from child process 4026]
GPU Device 0: NVIDIA GeForce RTX 5090
[New Thread 0x7fffebfff000 (LWP 4037)]
[New Thread 0x7fffeb7fe000 (LWP 4038)]
Total Global Memory: 32117 MB
Free Global Memory: 31459 MB
Memory allocated successfully on the GPU.
Memory initialized successfully on the GPU.

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x7ffddb74e460  memset32

Thread 1 "a.out" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 1, block (97,0,0), thread (0,0,0), device 0, sm 79, warp 0, lane 0]
0x00007ffddb74e490 in memset32<<<(243881,1,1),(512,1,1)>>> ()

test.cu code:

#include <iostream>
#include <fstream>
#include <vector>
#include <cstring>
#include <cuda_runtime.h>

#define CUDA_CHECK(call)                                                         \
    do {                                                                         \
        cudaError_t err = call;                                                  \
        if (err != cudaSuccess) {                                                \
            std::cerr << "CUDA error in " << __FILE__ << " at line " << __LINE__ \
                      << ": " << cudaGetErrorString(err) << std::endl;           \
            std::exit(EXIT_FAILURE);                                             \
        }                                                                        \
    } while (0)

void detectGPU(int &deviceID, size_t &totalMemory, size_t &freeMemory) {
    int deviceCount;
    CUDA_CHECK(cudaGetDeviceCount(&deviceCount));

    if (deviceCount == 0) {
        std::cerr << "No CUDA-capable device detected." << std::endl;
        std::exit(EXIT_FAILURE);
    }

    deviceID = 0;
    cudaDeviceProp deviceProp;
    CUDA_CHECK(cudaGetDeviceProperties(&deviceProp, deviceID));

    std::cout << "GPU Device " << deviceID << ": " << deviceProp.name << std::endl;
    totalMemory = deviceProp.totalGlobalMem;
    CUDA_CHECK(cudaMemGetInfo(&freeMemory, &totalMemory));
    std::cout << "Total Global Memory: " << totalMemory / (1024 * 1024) << " MB" << std::endl;
    std::cout << "Free Global Memory: " << freeMemory / (1024 * 1024) << " MB" << std::endl;
}

std::vector<char> loadGGUFFile(const std::string &filename) {
    std::ifstream file(filename, std::ios::binary);
    if (!file) {
        std::cerr << "Failed to open file: " << filename << std::endl;
        std::exit(EXIT_FAILURE);
    }
    file.seekg(0, std::ios::end);
    std::streampos fileSize = file.tellg();
    file.seekg(0, std::ios::beg);
    std::vector<char> buffer(fileSize);
    file.read(buffer.data(), fileSize);
    return buffer;
}

int main(int argc, char *argv[]) {
    if (argc != 3 || std::strcmp(argv[1], "-f") != 0) {
        std::cerr << "Usage: " << argv[0] << " -f <gguf_file>" << std::endl;
        return 1;
    }
    std::string filename = argv[2];
    int deviceID;
    size_t totalMemory, freeMemory;
    detectGPU(deviceID, totalMemory, freeMemory);
    std::vector<char> ggufData = loadGGUFFile(filename);
    char *d_data;
    CUDA_CHECK(cudaSetDevice(deviceID));
    CUDA_CHECK(cudaMalloc(&d_data, ggufData.size()));

    std::cout << "Memory allocated successfully on the GPU." << std::endl;
    CUDA_CHECK(cudaMemset(d_data, 0, ggufData.size()));
    std::cout << "Memory initialized successfully on the GPU." << std::endl;
    CUDA_CHECK(cudaMemcpy(d_data, ggufData.data(), ggufData.size(), cudaMemcpyHostToDevice));
    std::cout << "Data loaded from GGUF file to GPU memory." << std::endl;
    std::cout << "Data is kept in GPU memory. Press Enter to exit..." << std::endl;
    std::cin.get();
    CUDA_CHECK(cudaFree(d_data));
    std::cout << "Memory freed successfully on the GPU." << std::endl;
    return 0;
}

lscpi:

05:00.0 VGA compatible controller: NVIDIA Corporation Device 2b85 (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Micro-Star International Co., Ltd. [MSI] Device 5303
        Flags: bus master, fast devsel, latency 0, IRQ 92
        Memory at f8000000 (32-bit, non-prefetchable) [size=64M]
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Memory at e0000000 (64-bit, prefetchable) [size=32M]
        I/O ports at e000 [size=128]
        [virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: [40] Power Management version 3
        Capabilities: [48] MSI: Enable- Count=1/16 Maskable+ 64bit+
        Capabilities: [60] Express Legacy Endpoint, MSI 00
        Capabilities: [9c] Vendor Specific Information: Len=14 <?>
        Capabilities: [b0] MSI-X: Enable+ Count=9 Masked-
        Capabilities: [100] #19
        Capabilities: [12c] Latency Tolerance Reporting
        Capabilities: [134] #15
        Capabilities: [140] #24
        Capabilities: [14c] #25
        Capabilities: [158] #26
        Capabilities: [188] #2a
        Capabilities: [1b8] Advanced Error Reporting
        Capabilities: [200] #27
        Capabilities: [248] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [250] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [290] L1 PM Substates
        Capabilities: [2a4] Vendor Specific Information: ID=0001 Rev=1 Len=014 <?>
        Capabilities: [2bc] Power Budgeting <?>
        Capabilities: [2f4] Device Serial Number b3-66-c4-d2-db-2d-b0-48
        Kernel driver in use: nvidia
        Kernel modules: nvidia_drm, nvidia

Can you add cudaDeviceSynchronize within CUDA_CHECK after each command? And also test its return value?
Just to identify with certainty the failing instruction.

aginies@ryzen9:~/testcuda> cuda-gdb --args ./a.out -f qwen2.5-coder-3b-instruct-q4_0.gguf 
NVIDIA (R) cuda-gdb 12.8
....
Reading symbols from ./a.out...
(cuda-gdb) run
Starting program: /home/aginies/testcuda/a.out -f qwen2.5-coder-3b-instruct-q4_0.gguf
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff31ff000 (LWP 4878)]
[New Thread 0x7ffff1dff000 (LWP 4879)]
[Detaching after fork from child process 4880]
CUDA call: cudaGetDeviceCount(&deviceCount) returned no error
[New Thread 0x7fffebfff000 (LWP 4891)]
[New Thread 0x7fffeb7fe000 (LWP 4892)]
cudaDeviceSynchronize returned no error
CUDA call: cudaGetDeviceProperties(&deviceProp, deviceID) returned no error
cudaDeviceSynchronize returned no error
GPU Device 0: NVIDIA GeForce RTX 5090
CUDA call: cudaMemGetInfo(&freeMemory, &totalMemory) returned no error
cudaDeviceSynchronize returned no error
Total Global Memory: 32117 MB
Free Global Memory: 31495 MB
CUDA call: cudaSetDevice(deviceID) returned no error
cudaDeviceSynchronize returned no error
CUDA call: cudaMalloc(&d_data, ggufData.size()) returned no error
cudaDeviceSynchronize returned no error
Memory allocated successfully on the GPU.
CUDA call: cudaMemset(d_data, 0, ggufData.size()) returned no error

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x7ffddb74e460  memset32

Thread 1 "a.out" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 1, block (97,0,0), thread (0,0,0), device 0, sm 79, warp 0, lane 0]
0x00007ffddb74e490 in memset32<<<(243881,1,1),(512,1,1)>>> ()

CUDA_CHECK code

#define CUDA_CHECK(call)                                                         \
    do {                                                                         \
        cudaError_t err = call;                                                  \
        std::cout << "CUDA call: " << #call << " returned " << cudaGetErrorString(err) << std::endl; \
        if (err != cudaSuccess) {                                                \
            std::cerr << "CUDA error in " << __FILE__ << " at line " << __LINE__ \
                      << ": " << cudaGetErrorString(err) << std::endl;           \
            std::exit(EXIT_FAILURE);                                             \
        }                                                                        \
        err = cudaDeviceSynchronize();                                           \
        std::cout << "cudaDeviceSynchronize returned " << cudaGetErrorString(err) << std::endl; \
        if (err != cudaSuccess) {                                                \
            std::cerr << "CUDA synchronization error in " << __FILE__ << " at line " << __LINE__ \
                      << ": " << cudaGetErrorString(err) << std::endl;           \
            std::exit(EXIT_FAILURE);                                             \
        }                                                                        \
    } while (0)

Okay, so definitely within the cudaMemset and both times at the same address within the data structure.

What is the size of ggufData?

Is it an issue with memory above 2 GiB?

You could try to do the CudaMemset in parts with several calls in a for loop and pointers within the array to find out, which location fails.

Perhaps for some reason less memory was allocated?

1 Like

The size of the gguf is fine, only 2Gb. I have done some testing with different size of gguf, always the same error. I really think this is due to the driver as this is not reproductible on win10 with version 572. I can try to detect when the issue occur with CudaMemset but i am quite sure this start very early.

aginies@ryzen9:~/testcuda> du -sh qwen2.5-coder-3b-instruct-q4_0.gguf
1,9G	qwen2.5-coder-3b-instruct-q4_0.gguf

Well, no need to check if the file is too big, It fails at first Mb …

aginies@ryzen9:~/testcuda> cuda-gdb --args ./a.out -f qwen2.5-coder-3b-instruct-q4_0.gguf 
NVIDIA (R) cuda-gdb 12.8
....
Reading symbols from ./a.out...
(cuda-gdb) run
Starting program: /home/aginies/testcuda/a.out -f qwen2.5-coder-3b-instruct-q4_0.gguf
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff31ff000 (LWP 7059)]
[New Thread 0x7ffff1dff000 (LWP 7060)]
[Detaching after fork from child process 7061]
CUDA call: cudaGetDeviceCount(&deviceCount) at test6.cu:28 returned no error
[New Thread 0x7fffebfff000 (LWP 7082)]
[New Thread 0x7fffeb7fe000 (LWP 7083)]
cudaDeviceSynchronize at test6.cu:28 returned no error
CUDA call: cudaGetDeviceProperties(&deviceProp, deviceID) at test6.cu:37 returned no error
cudaDeviceSynchronize at test6.cu:37 returned no error
GPU Device 0: NVIDIA GeForce RTX 5090
CUDA call: cudaMemGetInfo(&freeMemory, &totalMemory) at test6.cu:41 returned no error
cudaDeviceSynchronize at test6.cu:41 returned no error
Total Global Memory: 32117 MB
Free Global Memory: 31589 MB
CUDA call: cudaSetDevice(deviceID) at test6.cu:85 returned no error
cudaDeviceSynchronize at test6.cu:85 returned no error
CUDA call: cudaMalloc(&d_data, ggufData.size()) at test6.cu:86 returned no error
cudaDeviceSynchronize at test6.cu:86 returned no error
Memory allocated successfully on the GPU. Pointer: 0x7ffcca000000
Memory initialized from offset 0 to 1048576
CUDA call: cudaMemset(d_data + offset, 0, sizeToSet) at test6.cu:95 returned no error

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x7ffddb74e460  memset32

Thread 1 "a.out" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 1, block (79,0,0), thread (416,0,0), device 0, sm 79, warp 13, lane 0]
0x00007ffddb74e490 in memset32<<<(128,1,1),(512,1,1)>>> ()

(cuda-gdb) info frame
Stack level 0, frame at 0xfffdc0:
 pc = 0x7ffddb74e490 in memset32; saved pc = <not saved>
 Outermost frame: unwinder did not report frame ID
 Arglist at 0xfffdc0, args: 
 Locals at 0xfffdc0,

(cuda-gdb) x/10xg 0x7ffddb74e460
0x7ffddb74e460 <memset32+608>:	0x0000000000000000	0x0000000000000000
0x7ffddb74e470 <memset32+624>:	0x0000000000000000	0x0000000000000000
0x7ffddb74e480 <memset32+640>:	0x0000000000000000	0x0000000000000000
0x7ffddb74e490 <memset32+656>:	0x0000000000000000	0x0000000000000000
0x7ffddb74e4a0 <memset32+672>:	0x0000000000000000	0x0000000000000000

(cuda-gdb) disassemble 0x7ffddb74e460
Dump of assembler code for function memset32:
   0x00007ffddb74e200 <+0>:	LDC R1,c[0x0][0x37c]
   0x00007ffddb74e210 <+16>:	S2UR UR10,SR_CTAID.Y
   0x00007ffddb74e220 <+32>:	LDCU.64 UR12,c[0x0][0x398]
   0x00007ffddb74e230 <+48>:	S2R R2,SR_TID.X
   0x00007ffddb74e240 <+64>:	HFMA2 R3,-RZ,RZ,0,0
   0x00007ffddb74e250 <+80>:	LDCU UR4,c[0x0][0x380]
   0x00007ffddb74e260 <+96>:	S2UR UR8,SR_CTAID.X
   0x00007ffddb74e270 <+112>:	LDCU.64 UR14,c[0x0][0x390]
   0x00007ffddb74e280 <+128>:	UMOV UR5,URZ
   0x00007ffddb74e290 <+144>:	LDC R9,c[0x0][0x360]
   0x00007ffddb74e2a0 <+160>:	UIMAD.WIDE.U32 UR6,UR10,UR12,URZ
   0x00007ffddb74e2b0 <+176>:	UIADD3 UR4,UPT,UPT,URZ,-UR6,-UR4
   0x00007ffddb74e2c0 <+192>:	UIMAD UR7,UR10,UR13,UR7
   0x00007ffddb74e2d0 <+208>:	LDCU.64 UR10,c[0x0][0x358]
   0x00007ffddb74e2e0 <+224>:	USHF.R.U32.HI UR4,URZ,0x2,UR4
   0x00007ffddb74e2f0 <+240>:	ULOP3.LUT UR4,UR4,0x1f,URZ,0xc0,!UPT
   0x00007ffddb74e300 <+256>:	IMAD.WIDE.U32 R2,R9,UR8,R2
   0x00007ffddb74e310 <+272>:	LDCU.64 UR8,c[0x0][0x380]
   0x00007ffddb74e320 <+288>:	UISETP.GE.U64.AND UP0,UPT,UR4,UR14,UPT
   0x00007ffddb74e330 <+304>:	@UP0 LDCU UR4,c[0x0][0x390]
   0x00007ffddb74e340 <+320>:	UIADD3.64 UR6,UPT,UPT,UR6,UR8,URZ
   0x00007ffddb74e350 <+336>:	ISETP.GE.U64.AND P0,PT,R2.reuse,UR4,PT
   0x00007ffddb74e360 <+352>:	IADD.64 R4,R2,UR4
   0x00007ffddb74e370 <+368>:	ISETP.GE.U64.AND P1,PT,R4,UR14,PT
   0x00007ffddb74e380 <+384>:	@!P0 LDC R11,c[0x0][0x388]
   0x00007ffddb74e390 <+400>:	@!P0 LEA R6,P2,R2,UR6,0x2
   0x00007ffddb74e3a0 <+416>:	@!P0 LEA.HI.X R7,R2,UR7,R3,0x2,P2
   0x00007ffddb74e3b0 <+432>:	@!P0 STG.E desc[UR10][R6.64],R11
   0x00007ffddb74e3c0 <+448>:	@P1 EXIT
   0x00007ffddb74e3d0 <+464>:	LDCU UR4,c[0x0][0x370]
   0x00007ffddb74e3e0 <+480>:	LDC R11,c[0x0][0x388]
   0x00007ffddb74e3f0 <+496>:	IMAD.WIDE.U32 R2,R9,UR4,RZ
   0x00007ffddb74e400 <+512>:	UMOV UR4,URZ
   0x00007ffddb74e410 <+528>:	IADD3 R7,PT,PT,R3,UR4,RZ
   0x00007ffddb74e420 <+544>:	MOV R6,R2
   0x00007ffddb74e430 <+560>:	LEA R2,P0,R4,UR6,0x2
   0x00007ffddb74e440 <+576>:	LEA.HI.X R3,R4.reuse,UR7,R5,0x2,P0
   0x00007ffddb74e450 <+592>:	IADD.64 R4,R4,R6
*> 0x00007ffddb74e460 <+608>:	STG.E desc[UR10][R2.64],R11
   0x00007ffddb74e470 <+624>:	ISETP.GE.U64.AND P0,PT,R4,UR14,PT
   0x00007ffddb74e480 <+640>:	@!P0 BRA 0x7ffddb74e430
=> 0x00007ffddb74e490 <+656>:	EXIT
   0x00007ffddb74e4a0 <+672>:	BRA 0x7ffddb74e4a0
   0x00007ffddb74e4b0 <+688>:	NOP
   0x00007ffddb74e4c0 <+704>:	NOP
   0x00007ffddb74e4d0 <+720>:	NOP
   0x00007ffddb74e4e0 <+736>:	NOP
   0x00007ffddb74e4f0 <+752>:	NOP
   0x00007ffddb74e500 <+768>:	NOP
   0x00007ffddb74e510 <+784>:	NOP
   0x00007ffddb74e520 <+800>:	NOP
   0x00007ffddb74e530 <+816>:	NOP
   0x00007ffddb74e540 <+832>:	NOP
   0x00007ffddb74e550 <+848>:	NOP
   0x00007ffddb74e560 <+864>:	NOP
   0x00007ffddb74e570 <+880>:	NOP
End of assembler dump.

cuda code:

#include <iostream>
#include <fstream>
#include <vector>
#include <cstring>
#include <cuda_runtime.h>
#include <iomanip> // For std::hex and std::setw

#define CUDA_CHECK(call)                                                         \
    do {                                                                         \
        cudaError_t err = call;                                                  \
        std::cout << "CUDA call: " << #call << " at " << __FILE__ << ":" << __LINE__ << " returned " << cudaGetErrorString(err) << std::endl; \
        if (err != cudaSuccess) {                                                \
            std::cerr << "CUDA error in " << __FILE__ << " at line " << __LINE__ \
                      << ": " << cudaGetErrorString(err) << std::endl;           \
            std::exit(EXIT_FAILURE);                                             \
        }                                                                        \
        err = cudaDeviceSynchronize();                                           \
        std::cout << "cudaDeviceSynchronize at " << __FILE__ << ":" << __LINE__ << " returned " << cudaGetErrorString(err) << std::endl; \
        if (err != cudaSuccess) {                                                \
            std::cerr << "CUDA synchronization error in " << __FILE__ << " at line " << __LINE__ \
                      << ": " << cudaGetErrorString(err) << std::endl;           \
            std::exit(EXIT_FAILURE);                                             \
        }                                                                        \
    } while (0)

void detectGPU(int &deviceID, size_t &totalMemory, size_t &freeMemory) {
    int deviceCount;
    CUDA_CHECK(cudaGetDeviceCount(&deviceCount));

    if (deviceCount == 0) {
        std::cerr << "No CUDA-capable device detected." << std::endl;
        std::exit(EXIT_FAILURE);
    }

    deviceID = 0;
    cudaDeviceProp deviceProp;
    CUDA_CHECK(cudaGetDeviceProperties(&deviceProp, deviceID));

    std::cout << "GPU Device " << deviceID << ": " << deviceProp.name << std::endl;
    totalMemory = deviceProp.totalGlobalMem;
    CUDA_CHECK(cudaMemGetInfo(&freeMemory, &totalMemory));
    std::cout << "Total Global Memory: " << totalMemory / (1024 * 1024) << " MB" << std::endl;
    std::cout << "Free Global Memory: " << freeMemory / (1024 * 1024) << " MB" << std::endl;
}

std::vector<char> loadGGUFFile(const std::string &filename) {
    std::ifstream file(filename, std::ios::binary);
    if (!file) {
        std::cerr << "Failed to open file: " << filename << std::endl;
        std::exit(EXIT_FAILURE);
    }
    file.seekg(0, std::ios::end);
    std::streampos fileSize = file.tellg();
    file.seekg(0, std::ios::beg);
    std::vector<char> buffer(fileSize);
    file.read(buffer.data(), fileSize);
    return buffer;
}

void printMemoryContents(const char *data, size_t size, const std::string &label) {
    std::cout << label << std::endl;
    for (size_t i = 0; i < size; ++i) {
        if (i % 16 == 0) {
            std::cout << std::hex << std::setw(8) << std::setfill('0') << i << ": ";
        }
        std::cout << std::hex << std::setw(2) << std::setfill('0') << static_cast<int>(data[i]) << " ";
        if ((i + 1) % 16 == 0) {
            std::cout << std::endl;
        }
    }
    std::cout << std::dec << std::endl;
}

int main(int argc, char *argv[]) {
    if (argc != 3 || std::strcmp(argv[1], "-f") != 0) {
        std::cerr << "Usage: " << argv[0] << " -f <gguf_file>" << std::endl;
        return 1;
    }
    std::string filename = argv[2];
    int deviceID;
    size_t totalMemory, freeMemory;
    detectGPU(deviceID, totalMemory, freeMemory);
    std::vector<char> ggufData = loadGGUFFile(filename);
    char *d_data;
    CUDA_CHECK(cudaSetDevice(deviceID));
    CUDA_CHECK(cudaMalloc(&d_data, ggufData.size()));

    std::cout << "Memory allocated successfully on the GPU. Pointer: " << static_cast<void*>(d_data) << std::endl;

    // Perform cudaMemset in parts
    size_t chunkSize = 1024 * 1024; // 1 MB chunks
    for (size_t offset = 0; offset < ggufData.size(); offset += chunkSize) {
        size_t sizeToSet = std::min(chunkSize, ggufData.size() - offset);
        CUDA_CHECK(cudaMemset(d_data + offset, 0, sizeToSet));
        std::cout << "Memory initialized successfully on the GPU from offset " << offset << " to " << offset + sizeToSet << ". Pointer: " << static_cast<void*>(d_data + offset) << std::endl;
    }

    // Print memory contents before copying
    std::vector<char> hostDataBefore(ggufData.size());
    CUDA_CHECK(cudaMemcpy(hostDataBefore.data(), d_data, ggufData.size(), cudaMemcpyDeviceToHost));
    printMemoryContents(hostDataBefore.data(), std::min(ggufData.size(), static_cast<size_t>(256)), "Memory contents before copying:");

    CUDA_CHECK(cudaMemcpy(d_data, ggufData.data(), ggufData.size(), cudaMemcpyHostToDevice));
    std::cout << "Data loaded from GGUF file to GPU memory. Pointer: " << static_cast<void*>(d_data) << std::endl;

    // Print memory contents after copying
    std::vector<char> hostDataAfter(ggufData.size());
    CUDA_CHECK(cudaMemcpy(hostDataAfter.data(), d_data, ggufData.size(), cudaMemcpyDeviceToHost));
    printMemoryContents(hostDataAfter.data(), std::min(ggufData.size(), static_cast<size_t>(256)), "Memory contents after copying:");

    std::cout << "Data is kept in GPU memory. Press Enter to exit..." << std::endl;
    std::cin.get();
    CUDA_CHECK(cudaFree(d_data));
    std::cout << "Memory freed successfully on the GPU." << std::endl;
    return 0;
}

suggestions:

  1. simplify your code. If the problem is related to cudaMemset there should be no need to load a file. Hardcode the sizes.
  2. File a bug

Could it also be a defect in the video RAM?
Can you clear the memory with a kernel setting to 0 instead of cudaMemset?

Even if it happens in the first MB, does it also happen if you allocate much less than 2 GB?

2^31 does not fit into signed 32 bit integers and can lead to bugs or limitations (e.g. in the operating system). But cudaMalloc normally should handle those sizes for 64-bit host code. Nevertheless it would be good to show it happening with smaller allocation size to give a better bug report (as Robert recommended a bug report probably is the best way to go forward).

Hmm, i doubt → Why this is working under Windows? I can load very big gguf without any issue (using ollama).

Just FYI: after 2 months, the bug is still present, even with latest beta version released today and cuda 12.9. Nvidia was not able to find the root cause.

For others who are reading these things:

Specifically, NVIDIA has not been able to reproduce or observe the issue in-house with production level 5080 and 5090 boards. 5112735

that’s referring to a GPU driver

That is again referring to a driver. CUDA toolkit 12.9 (or whatever the next version after 12.8.1 is called) is not released, as of today.

Sorry I should have been more precise:

ryzen9:/mnt/156b/home/aginies/testcuda # nvidia-smi
Wed Apr 16 17:27:33 2025
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.02 Driver Version: 575.51.02 CUDA Version: 12.9 |

This is the output given by nvidia-smi binairie. Also previous CUDA 12.8.1 / 12.8 and Driver 570.133.07, 570.86.10 didn’t solve the issue. This could be an Hardware issue? But I don’t know, I didn’t get lot of feedback.

To avoid misnderstandings: The CUDA version reported by nvidia-smi is the maximum CUDA version supported by the installed NVIDIA driver package. nvidia-smi is itself part of that driver package. The reported CUDA version therefore has nothing to do with any CUDA version(s) actually installed on the machine. As Robert_Crovella points out, the latest released CUDA version is 12.8.1, which I can confirm from installing this on one of my machines yesterday.

Generally speaking, a software problem (e.g. driver) is more likely than a hardware issue. For example, after I installed CUDA 12.8.1 I also updated my drivers to the latest and found that the latest released Windows drivers broke TCC operation on my Pascal-based Quadro card, although that is officially still supported (still works fine with the WDDM driver). Layered software environments are sufficiently complex that NVIDIA’s inability to reproduce your issue in-house so far may have nothing to do with hardware (but that could be the case).

In general industry practice, the general process of bug resolution requires in-house reproducibility as the first step.

Thanks for the information. This is a bit confusing.

I don’t think you will find anybody here in disagreement with that assessment (see numerous previous threads in these forums touching on this). The way nvidia-smi presents the device summary is generally quite confusing.

I’ve reproduced a very similar looking illegal memory access on rtx 5090, using the code at taichi-play/run_ir/all_in_one_v3.cu at main · hughperkins/taichi-play · GitHub @aginies thoughts on the extent to which you feel this might be the same bug you are experiencing? (I wrote up a description here taichi-play/run_ir/nvidia_bug_report_for_8730 at main · hughperkins/taichi-play · GitHub )

Sounds like this is a very similar issue.
Any llama or ollama applications will lead to same error. There is something wrong in the Linux Driver, as this is working under Windows.

Did you find a solution to that problem?

I reported the bug, along with the reproduction approach, here Log in | NVIDIA Developer It was fixed on June 23. It will be available in cuda 13.1, or next product driver release (I think this means v570 driver, not v575 driver), if you are using JIT (which I am, 98% sure; not sure if you are?).

I have 2 identical 5090 (same brand, same model ), one is not working in linux(NVIDIA-SMI 570.169 Driver Version: 570.169 CUDA Version: 12.8 ) : illegal memory access.
But then I switch from linux to windows (wsl: NVIDIA-SMI 570.133.07 Driver Version: 572.83 CUDA Version: 12.8 ) then GPU working without any problems.