I have a working ubuntu machine with nvidia drivers and libs to run neural networks with TensorRT and TritonInferenceServer. My machine was inferencing for several days and today I found out that during the night some issue occured. And the issue is that now nvidia-smi
gives me this:
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 535.171
And it was never the case, I’ve been using it for months. And obviously, I didn’t do anything, I was asleep. What is strange - this issue was fixed after reboot… Please tell me what is happening, why nvidia drivers are being unreliable for me? Why it happened out of nothing?
nvidia-bug-report.log.gz (170.9 KB)
And also, what is wrong with GPUtil.getGPUs()
, why it fails even if it is covered in Try-Except
in python? How to reliably parse GPU info (util, temp, vram)? And if drivers are for some reason down, how to catch error with GPUtil.getGPUs()
?
My machine:
Ubuntu 22.04
NVIDIA-SMI 535.171.04
Driver Version: 535.171.04
CUDA Version: 12.2
cuDNN version: 8902
TensorRT version: 8.6.1
And screenshot after reboot with working nvidia smi: