Nvidia driver crashing on restart (cuda 11.7.1/11.8)

Hi everyone,

We’ve been running into an issue, primarily with our RTX A4500 and RTX A5000 servers, that the nvidia driver will crash on restart. The only way to reliably fix it that we’ve found is to reinstall the driver. It’s gotten to the point where we’re reinstalling the driver nearly once a day. We’ve tried removing GPUs to try and isolate the issue, but it seems random. Has anyone been experiencing this issue recently? We need to use either cuda 11.7 or 11.8 since the applications we use require it.

titan2-nvidia-bug-report.log.gz (75.8 KB)

Second bug report since I can only put one link per post apparently

kraken-nvidia-bug-report.log.gz (85.5 KB)