We’ve been running into an issue, primarily with our RTX A4500 and RTX A5000 servers, that the nvidia driver will crash on restart. The only way to reliably fix it that we’ve found is to reinstall the driver. It’s gotten to the point where we’re reinstalling the driver nearly once a day. We’ve tried removing GPUs to try and isolate the issue, but it seems random. Has anyone been experiencing this issue recently? We need to use either cuda 11.7 or 11.8 since the applications we use require it.
titan2-nvidia-bug-report.log.gz (75.8 KB)