With latest 575.51.02 driver, after working for some time, CUDA started to fail to initialize after a day of uptime

After a bit more than a day after I upgrade to this driver, all worked fine at first, but then I started getting errors like this when trying to run llama-server:

ggml_cuda_init: failed to initialize CUDA: initialization error
Available buffer types:
  CPU

Or like this when trying to run mpv:

[ffmpeg] AVHWDeviceContext: cu->cuInit(0) failed -> CUDA_ERROR_NOT_INITIALIZED: initialization error
[ffmpeg] AVHWDeviceContext: Device creation failure: VK_ERROR_INITIALIZATION_FAILED

What is even more strange, at first after some attempts to run llama-server and mpv, it just started to work, even though I did not do anything. But shortly after that, stopped working again - already running programs are not affected, so CUDA actually continues to work. But now these errors do not go away.

This feels like a driver bug, so I am attaching the bug report:
nvidia-bug-report.log.gz (3.5 MB)

Is there any workaround to recover, or at this point only reboot can help?

I am in the middle of some file operations that may take very long time so I cannot reboot easily, besides I need a stable system, and older driver had different issues. I have four 3090 GPUs and using Ubuntu 25.04, I also am using EPYC server motherboard and have online UPS. OpenGL and Vulkan seem to continue working, it just will not let me to run CUDA applications, and I cannot find a way to bring CUDA back. If anyone has any ideas what to try as an workaround, please share.

UPDATE: After dozens of attempt to run llama-server again, it started working - just by attempting, nothing else, and mpv and other CUDA applications also started working. Just waiting did not seem to help, but actively trying to initialize CUDA does. I also noticed that the issue is more likely to occur if I forcefully stop stop llama-server by multiple Ctrl+C presses, but it may happen without that too, and it is system wide, not related to any particular application, so looks like a driver bug. The attached bug report was done during the time while the issue was present so hopefully it provides an insight into what may be the cause of the bug in the driver.

1 Like

Reading your story was like a horror roller-coaster show because I also use llama, cuda, etc.

So it started scary… Then horror.. Then calling for the gods, then calm and peace.

I hope there is no 2nd movie about this. I almost died.