I’m attempting to do some computing work on my GeForce 690 using Cuda 7.0 on Ubuntu server 14.04
It seems like basically any program that tries to access the card on my system hangs, and can’t be killed. This includes some trivially simple code I wrote and compiled with nvcc, as well as, e.g., nvidia-smi.
I ran an strace on such a program, and it halts on an open syscall attempting to open the device file “/dev/nvidiactl”.
Trying to rmmod the device driver gives a message saying the device is busy, but I’ve scanned for any programs with an open file descriptor pointing to the device and there are none.
I’d love some help in figuring out how to even investigate this further. I don’t know if it’s related to the hardware, kernel module, my cuda version, or some installation inconsistency.
I’ve used cuda libs before with this card on this machine (via theano), but I’m trying to play with TensorFlow, and have therefore uninstalled and reinstalled new versions of cuda, etc. I don’t know what state things were in when they were working before…they just worked!
I have uninstalled and reinstalled cuda 7.0 several times. At least one of those times, I was able to run nvidia-smi once and some compute code once, then things started freezing again. After reboot, it seems to have the same issue. I haven’t tried uninstalling and reinstalling again…it’s rather tedious to do repeatedly!
Let me know what further info I can share, and hopefully this can be sorted out. Thanks in advance!
I should mention that the system is headless, and no graphics or desktop environment programs are running (no X11, no Gnome/Unity/etc). No display is attached to the machine.
You might have a conflict with the nouveau driver (it’s discussed in the install guide). Furthermore, if you’ve installed things (CUDA toolkit, GPU driver) using a mix of runfile installer methods and package manager methods, that is a recipe for trouble as indicated in the install guide.
I have some updates on this. Nothing I’ve tried seems to have permanently fixed the issue, though (maddeningly) it comes and goes.
I have reinstalled the OS multiple times (getting really efficient at it). I’m now running 15.04 Ubuntu server. I installed cuda 7.5 via the runfile (only) on a totally clean OS. Tensorflow should work fine with 7.5 these days, but anyway I have replicated this issue with non-tensorflow code (directly compiled c++/cuda program).
I have run both connected to and disconnected from a display (not that this should matter, but thought I’d try). The problem manifests in both states.
Nouveau is blacklisted and doesn’t get loaded into the kernel at any point.
In the most recent failure, I caught the following output in my dmesg log:
That AMD motherboard is pretty old. Have you disabled the on-board graphics?
You might also want to see if there are any BIOS updates for that motherboard. The latest BIOS update utility appears to have a 2016 date on it, whereas your BIOS appears to be from 2010.
I tried to find a BIOS update but couldn’t spot anything more recent – can you link me to the 2016 version you found?
I am planning to buy a new motherboard ASAP anyway, since my new Titan X is invisible to this one (I think we have exchanged some comments on another thread about that :).