- Summary : illegal memory access with RTX5090 with 570.86.16
- Relevant Area: usage detected in llama.cpp
- NVIDIA GPU or System: GeForce RTX 5090
- CUDA Version: 12.8
- NVIDIA Software Version: NVIDIA-SMI 570.86.16
- OS: Leap15.6 kernel 6.4 x86_64
- Other Details: tested ok on Quadro P1000 4GB
I have done some testing of ollama and llama.ccp with the RTX5090. This has result in a bug around cudaMemset(). To have done some comparison testing of a simple program which load a gguf file into memory there is a bug in the Driver with the RTX5090 card. I have done some testing under win10 version 572 and didn’t have any issue. When ca we expect getting a update version of the driver?
on P1000:
aginies@linux-5530:~/testcuda> ./a.out -f qwen2.5-coder-3b-instruct-q4_0.gguf
GPU Device 0: Quadro P1000
Total Global Memory: 4034 MB
Free Global Memory: 3994 MB
Memory allocated successfully on the GPU.
Memory initialized successfully on the GPU.
Data loaded from GGUF file to GPU memory.
Data is kept in GPU memory. Press Enter to exit...
Memory freed successfully on the GPU.
on RTX5090:
aginies@ryzen9:~/testcuda> cuda-gdb --args ./a.out -f qwen2.5-coder-3b-instruct-q4_0.gguf
NVIDIA (R) cuda-gdb 12.8
....
Reading symbols from ./a.out...
(cuda-gdb) run
Starting program: /home/aginies/testcuda/a.out -f qwen2.5-coder-3b-instruct-q4_0.gguf
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff31ff000 (LWP 4024)]
[New Thread 0x7ffff1dff000 (LWP 4025)]
[Detaching after fork from child process 4026]
GPU Device 0: NVIDIA GeForce RTX 5090
[New Thread 0x7fffebfff000 (LWP 4037)]
[New Thread 0x7fffeb7fe000 (LWP 4038)]
Total Global Memory: 32117 MB
Free Global Memory: 31459 MB
Memory allocated successfully on the GPU.
Memory initialized successfully on the GPU.
CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x7ffddb74e460 memset32
Thread 1 "a.out" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 1, block (97,0,0), thread (0,0,0), device 0, sm 79, warp 0, lane 0]
0x00007ffddb74e490 in memset32<<<(243881,1,1),(512,1,1)>>> ()
test.cu code:
#include <iostream>
#include <fstream>
#include <vector>
#include <cstring>
#include <cuda_runtime.h>
#define CUDA_CHECK(call) \
do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
std::cerr << "CUDA error in " << __FILE__ << " at line " << __LINE__ \
<< ": " << cudaGetErrorString(err) << std::endl; \
std::exit(EXIT_FAILURE); \
} \
} while (0)
void detectGPU(int &deviceID, size_t &totalMemory, size_t &freeMemory) {
int deviceCount;
CUDA_CHECK(cudaGetDeviceCount(&deviceCount));
if (deviceCount == 0) {
std::cerr << "No CUDA-capable device detected." << std::endl;
std::exit(EXIT_FAILURE);
}
deviceID = 0;
cudaDeviceProp deviceProp;
CUDA_CHECK(cudaGetDeviceProperties(&deviceProp, deviceID));
std::cout << "GPU Device " << deviceID << ": " << deviceProp.name << std::endl;
totalMemory = deviceProp.totalGlobalMem;
CUDA_CHECK(cudaMemGetInfo(&freeMemory, &totalMemory));
std::cout << "Total Global Memory: " << totalMemory / (1024 * 1024) << " MB" << std::endl;
std::cout << "Free Global Memory: " << freeMemory / (1024 * 1024) << " MB" << std::endl;
}
std::vector<char> loadGGUFFile(const std::string &filename) {
std::ifstream file(filename, std::ios::binary);
if (!file) {
std::cerr << "Failed to open file: " << filename << std::endl;
std::exit(EXIT_FAILURE);
}
file.seekg(0, std::ios::end);
std::streampos fileSize = file.tellg();
file.seekg(0, std::ios::beg);
std::vector<char> buffer(fileSize);
file.read(buffer.data(), fileSize);
return buffer;
}
int main(int argc, char *argv[]) {
if (argc != 3 || std::strcmp(argv[1], "-f") != 0) {
std::cerr << "Usage: " << argv[0] << " -f <gguf_file>" << std::endl;
return 1;
}
std::string filename = argv[2];
int deviceID;
size_t totalMemory, freeMemory;
detectGPU(deviceID, totalMemory, freeMemory);
std::vector<char> ggufData = loadGGUFFile(filename);
char *d_data;
CUDA_CHECK(cudaSetDevice(deviceID));
CUDA_CHECK(cudaMalloc(&d_data, ggufData.size()));
std::cout << "Memory allocated successfully on the GPU." << std::endl;
CUDA_CHECK(cudaMemset(d_data, 0, ggufData.size()));
std::cout << "Memory initialized successfully on the GPU." << std::endl;
CUDA_CHECK(cudaMemcpy(d_data, ggufData.data(), ggufData.size(), cudaMemcpyHostToDevice));
std::cout << "Data loaded from GGUF file to GPU memory." << std::endl;
std::cout << "Data is kept in GPU memory. Press Enter to exit..." << std::endl;
std::cin.get();
CUDA_CHECK(cudaFree(d_data));
std::cout << "Memory freed successfully on the GPU." << std::endl;
return 0;
}
lscpi:
05:00.0 VGA compatible controller: NVIDIA Corporation Device 2b85 (rev a1) (prog-if 00 [VGA controller])
Subsystem: Micro-Star International Co., Ltd. [MSI] Device 5303
Flags: bus master, fast devsel, latency 0, IRQ 92
Memory at f8000000 (32-bit, non-prefetchable) [size=64M]
Memory at d0000000 (64-bit, prefetchable) [size=256M]
Memory at e0000000 (64-bit, prefetchable) [size=32M]
I/O ports at e000 [size=128]
[virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
Capabilities: [40] Power Management version 3
Capabilities: [48] MSI: Enable- Count=1/16 Maskable+ 64bit+
Capabilities: [60] Express Legacy Endpoint, MSI 00
Capabilities: [9c] Vendor Specific Information: Len=14 <?>
Capabilities: [b0] MSI-X: Enable+ Count=9 Masked-
Capabilities: [100] #19
Capabilities: [12c] Latency Tolerance Reporting
Capabilities: [134] #15
Capabilities: [140] #24
Capabilities: [14c] #25
Capabilities: [158] #26
Capabilities: [188] #2a
Capabilities: [1b8] Advanced Error Reporting
Capabilities: [200] #27
Capabilities: [248] Alternative Routing-ID Interpretation (ARI)
Capabilities: [250] Single Root I/O Virtualization (SR-IOV)
Capabilities: [290] L1 PM Substates
Capabilities: [2a4] Vendor Specific Information: ID=0001 Rev=1 Len=014 <?>
Capabilities: [2bc] Power Budgeting <?>
Capabilities: [2f4] Device Serial Number b3-66-c4-d2-db-2d-b0-48
Kernel driver in use: nvidia
Kernel modules: nvidia_drm, nvidia