Debugging adapter disappearing/malfunction

Hi!

I have a fresh install of Ubuntu 13.04 with the driver and the CUDA toolkit that are in the Ubuntu repo. nvidia-smi returned info from all available adapters, two Tesla C2050 cards. I thought everything was nice and working.

Users coming to test the cluster said that they get weird error messages, and indeed, even nvidia-smi says:

NVIDIA: could not open the device file /dev/nvidia0 (No such file or directory).
NVIDIA-SMI has failed because it couldn't communicate with NVIDIA driver. Make s                                                                                                                                                                                      ure that latest NVIDIA driver is installed and running.

There is indeed no /dev/nvidia0, but there is /dev/nvidia1. Practically nothing serious happened since I last saw nvidia-smi work. The machine is in a well cooled server room, so it’s definately not due to overheating. (Not to mention it was pretty much just rendering lightdm log-in screen at best)

The driver is up and running (lsmod | grep nvidia) as well as lspci show both cards. What is the standard procdeure of debugging such situations? I’d like to know the order of things to check such as: lspci, driver, driver queries, xserver config, and the likes…

Thank you for the help.

I have found an interesting phenomenon…

Ubuntu 13.04 (desktop version), driver version 130.44, two Tesla C2050 cards. System install and configuration, all is fine. The machine is in a server room inside a rack, so once installation is finished, I unplug the monitor-keyboard-mouse and leave.

After that, if I restart the machine, and the monitor is not hooked up, /dev/nvidia0 is not present. If at boot time the monitor is hooked up, everything is fine, even after if I check via ssh, nvidia-smi still shows both adapters.

Am I doing something wrong? There was no such issue with the GUI-less Scientific Linux system.

look at step 6 here:

[url]CUDA Toolkit Documentation

alternatively, you could just put a dummy monitor load on your video output:

[url]DVI to VGA Dummy.....56K!
[url]http://www.bononia.it/~renzo/keap/VirXGA.pdf[/url]

or

[url]mrb's blog

Next time you find the NVIDIA /dev files missing, try running the script from the CUDA Linux release notes, and see if that fixes it:

[url]Release Notes :: CUDA Toolkit Documentation

These /dev files are also automatically made by X when it starts, which is why you often don’t need to run a separate script to make them. It is possible that the version of X in Ubuntu 13.04 is failing to do that when a monitor is not plugged into the card. (No idea why, but worth a shot.)