Hi, all. I’ll try to keep this short, but I’m often not very good at that.
I’ve got a set of 32 Dell PE R730 servers with 2 Tesla K80s each, that are working great using Nvidia driver 361.28. I’m trying to add a single K40m to a couple of other Dell PE R730 servers, which didn’t have GPUs previously, using the same OS image and driver, and I’m getting this output when I run “nvidia-smi”:
Unable to determine the device handle for GPU 0000:03:00.0: The NVIDIA kernel module detected an issue with GPU interrupts.Consult the “Common Problems” Chapter of the NVIDIA Driver README for
details and steps that can be taken to resolve this issue.
In this state, either after a period of time, or if I try to use the GPU (eg. through CUDA), I get a kernel panic. Using something like “sosreport” triggers the kernel panic every time.
We do the nvidia driver install (eg. “NVIDIA-Linux-x86_64-361.28.run --silent”) during the first boot after a re-image of a server. Just today I discovered that if I reboot (but not reinstall) the server, then it seems to recover, and I get expected output from “nvidia-smi” after the clean reboot. This has been consistent behavior over several instances of reinstall and then clean reboot, and with both the 358.13 and 361.28 drivers.
I’ve tried just simply doing a “modprobe -r nvidia; modprobe nvidia” to see if it also cleans it up, but that didn’t seem to do anything.
If needed, I can script something that selectively reboots after install, if the host has a K40 in it. I just was hoping someone could shed some more light on what might be going on, and if there’s a cleaner way to fix this. I’m having trouble understanding what the problem is, let alone why it affects the 1xK40 node, but not the 2xK80 nodes.
Any thoughts here?
nvidia-bug-report.log.gz (61.6 KB)