nvidia-smi hangs indefinitely: what could be the issue?

I try to run nvidia-smi from the shell on a Ubuntu 14.04.4 LTS x64 machine, but it hangs indefinitely: what could be the issue?

Below are some more information is needed:

  1. It used to work but stopped working at some point.

  2. Rebooting doesn’t fix the issue.

  3. I had installed the Nvidia drivers with the following:

Install Nvidia drivers, CUDA and CUDA toolkit, following some instructions from http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb # Got the link at https://developer.nvidia.com/cuda-downloads
sudo dpkg -i cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb
sudo apt-get update
sudo apt-get install cuda

  1. My computer has 4 Titan X GPUs attached to it, I can still see them with sudo lshw -C display.

I’ve got the same issue on a ubuntu14.04 machine with 4 m60 gpus, cuda installed using cuda_7.5.18_linux.run and cudnn installed with cudnn-7.0-linux-x64-v4.0-prod.tgz which iiuc is compatible with cuda 7.5. (I need cudnn4 and not 5 due to some slightly-stale code).

nvidia-smi which is my main ‘what’s going on with the gpus tool’ hangs indefinitely on this machine (not on any others, which all have 4 k80 gpus.)

anyone? bueller?

if lspci is not returning an answer this https://askubuntu.com/questions/909991/lspci-returns-cannot-open-sys-bus-pci-devices-xxxxx-resource-no-such-file-or may do the trick

backup your vitals, then

apt-get remove linux-image-4.4.0-75-generic  

this driver update http://stackoverflow.com/questions/41489070/nvidia-smi-process-hangs-and-cant-be-killed-with-sigkill-eithermay also help

Franck, Jery, I noticed both of you have 4 GPUs on your system. Is this one a single-CPU machine? I read somewhere that it may be due to hardware interrupt issues. Something about the first core on a CPU having to handle all hardware interrupts, and it can cause issues if the interrupt handler runs too slow.

So I’m wondering if 4 GPUs is too much for the CPU’s interrupt handler. I’m experiencing the same issue on a dual-Xeon machine with 8 GPUs (4 GPUs per Xeon processor).

I’ve tried forcing persistence to “on”, which sped up nvidia-smi drastically (it used to take 5 seconds to gather all GPU data, but now it’s instant). However, nvidia-smi just hung on me right now, after calling it while 6 GPUs were running.

Are you guys still having this issue? I’m debating removing 2 GPUs so each CPU only has to query 3 GPUs, but that would be disappointing.

Hi! tinkerthinker.

I am experiencing the same issue. I too, have 4 GPUs with one CPU. The workstation went well when using 2 GPUs. The hangs happended when using 3 or 4 GPUs.

Except for the “interruption”, I am also guessing the temperature problem. Since 4 reference GPUs generate more heat when there isn’t a water-cooling fan.