I have a fresh install of Ubuntu 13.04 with the driver and the CUDA toolkit that are in the Ubuntu repo. nvidia-smi returned info from all available adapters, two Tesla C2050 cards. I thought everything was nice and working.
Users coming to test the cluster said that they get weird error messages, and indeed, even nvidia-smi says:
NVIDIA: could not open the device file /dev/nvidia0 (No such file or directory). NVIDIA-SMI has failed because it couldn't communicate with NVIDIA driver. Make s ure that latest NVIDIA driver is installed and running.
There is indeed no /dev/nvidia0, but there is /dev/nvidia1. Practically nothing serious happened since I last saw nvidia-smi work. The machine is in a well cooled server room, so it’s definately not due to overheating. (Not to mention it was pretty much just rendering lightdm log-in screen at best)
The driver is up and running (lsmod | grep nvidia) as well as lspci show both cards. What is the standard procdeure of debugging such situations? I’d like to know the order of things to check such as: lspci, driver, driver queries, xserver config, and the likes…
Thank you for the help.