Hi @amrits , I was having the exact same issue on two machines (one laptop, one PC). I tested the state with the following code:
#include <stdio.h>
#include <cuda.h>
int main() {
CUresult ret = cuInit(0);
if (ret != CUDA_SUCCESS) {
fprintf(stderr, "cuInit failed! Error code: %d\n", ret);
return 1;
}
printf("CUDA initialized successfully!\n");
return 0;
}
Compiled and ran with this:
cc test_cuda.c -lcuda -I/opt/cuda/include && ./a.out
I get the following after I suspend once:
cuInit failed! Error code: 999
If I run sudo modprobe -r nvidia_uvm && sudo modprobe nvidua_uvm
it does work. I just now enabled options nvidia NVreg_PreserveVideoMemoryAllocations=1
in /etc/modprobe/nvidia.conf
and enabled nvidia-suspend.service
and nvidia-hibernate.service
(as per Arch wiki). That seems to have fixed my issue, I no longer need to unload and load nvidia_uvm
anymore. Though I tested this only briefly, if things change I will let you know.
In case it helps, attached is an nvidia-bug-report from before I made those changes.
nvidia-bug-report.log.gz (1.8 MB)