I’m attempting to do some computing work on my GeForce 690 using Cuda 7.0 on Ubuntu server 14.04
It seems like basically any program that tries to access the card on my system hangs, and can’t be killed. This includes some trivially simple code I wrote and compiled with nvcc, as well as, e.g., nvidia-smi.
I ran an strace on such a program, and it halts on an open syscall attempting to open the device file “/dev/nvidiactl”.
Trying to rmmod the device driver gives a message saying the device is busy, but I’ve scanned for any programs with an open file descriptor pointing to the device and there are none.
I’d love some help in figuring out how to even investigate this further. I don’t know if it’s related to the hardware, kernel module, my cuda version, or some installation inconsistency.
I’ve used cuda libs before with this card on this machine (via theano), but I’m trying to play with TensorFlow, and have therefore uninstalled and reinstalled new versions of cuda, etc. I don’t know what state things were in when they were working before…they just worked!
I have uninstalled and reinstalled cuda 7.0 several times. At least one of those times, I was able to run nvidia-smi once and some compute code once, then things started freezing again. After reboot, it seems to have the same issue. I haven’t tried uninstalling and reinstalling again…it’s rather tedious to do repeatedly!
Let me know what further info I can share, and hopefully this can be sorted out. Thanks in advance!
I should mention that the system is headless, and no graphics or desktop environment programs are running (no X11, no Gnome/Unity/etc). No display is attached to the machine.
Start over with a clean load of the OS.
Then follow the instructions in the linux install guide carefully.
For Tensorflow, you will need to use CUDA 7. If you intend to install CUDA via package manager method, you’ll want to read this thread first:
I’m having almost the same problem with Ubuntu 14.04, two GTX Titan Xs and Cuda 7.0.
Programs using Cuda can run once after reboot, but after that even nvidia-smi freezes when it tries to open “/dev/nvidiactl”.
Did you find any solution to this?
Or do I have to try reinstalling things again?
You might have a conflict with the nouveau driver (it’s discussed in the install guide). Furthermore, if you’ve installed things (CUDA toolkit, GPU driver) using a mix of runfile installer methods and package manager methods, that is a recipe for trouble as indicated in the install guide.
I have some updates on this. Nothing I’ve tried seems to have permanently fixed the issue, though (maddeningly) it comes and goes.
I have reinstalled the OS multiple times (getting really efficient at it). I’m now running 15.04 Ubuntu server. I installed cuda 7.5 via the runfile (only) on a totally clean OS. Tensorflow should work fine with 7.5 these days, but anyway I have replicated this issue with non-tensorflow code (directly compiled c++/cuda program).
I have run both connected to and disconnected from a display (not that this should matter, but thought I’d try). The problem manifests in both states.
Nouveau is blacklisted and doesn’t get loaded into the kernel at any point.
In the most recent failure, I caught the following output in my dmesg log:
[ 172.277550] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
[ 172.277846] IP:  _nv002814rm+0x51d/0x610 [nvidia]
[ 172.277849] PGD 0
[ 172.277853] Oops: 0000 [#1] SMP
[ 172.277894] Modules linked in: nvidia_uvm(POE) ctr ccm nvidia(POE) arc4 ath9k ath9k_common ath9k_hw snd_hda_codec_realtek snd_hda_codec_generic ath mac80211 snd_hda_codec_hdmi cfg80211 ppdev snd_h
da_intel snd_hda_controller kvm_amd snd_hda_codec kvm snd_hwdep edac_core drm snd_pcm edac_mce_amd k10temp snd_timer snd serio_raw soundcore shpchp parport_pc parport tpm_infineon 8250_fintek i2c_nfo
rce2 mac_hid autofs4 r8169 pata_acpi psmouse mii sata_nv pata_amd
[ 172.277902] CPU: 1 PID: 993 Comm: python Tainted: P OE 3.19.0-68-generic #76-Ubuntu
[ 172.277904] Hardware name: MICRO-STAR INTERNATIONAL CO.,LTD MS-7597/GF615M-P33 (MS-7597), BIOS V2.7 12/13/2010
[ 172.277908] task: ffff88009a248000 ti: ffff88009a5a4000 task.ti: ffff88009a5a4000
[ 172.278150] RIP: 0010:  _nv002814rm+0x51d/0x610 [nvidia]
[ 172.278153] RSP: 0018:ffff88009a5a7a28 EFLAGS: 00010246
[ 172.278156] RAX: 0000000000000000 RBX: ffff88009aa14008 RCX: 0000000000000000
[ 172.278158] RDX: 0000000000000000 RSI: 0000000000000011 RDI: 0000000000000000
[ 172.278160] RBP: ffff880244422f58 R08: 0000000000000014 R09: ffff88009a3c0870
[ 172.278162] R10: 0000000000000296 R11: ffffffffc1431db0 R12: 0000000000000000
[ 172.278164] R13: 0000000000000001 R14: 0000000000000001 R15: ffff88009b922008
[ 172.278168] FS: 00007fd6d10e5700(0000) GS:ffff88024fc40000(0000) knlGS:0000000000000000
[ 172.278170] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 172.278173] CR2: 0000000000000020 CR3: 0000000001c13000 CR4: 00000000000007e0
[ 172.278174] Stack:
[ 172.278179] ffff88009aa14008 ffff8802439e9008 ffff880244e05008 0000000000000000
[ 172.278183] ffff88009ae0c008 ffffffffc100f540 ffff8802439e9008 ffff88009aa14008
[ 172.278187] 0000000000001100 0000000000000000 0000000000000024 ffffffffc1006291
[ 172.278188] Call Trace:
[ 172.278429]  ? _nv003016rm+0xf0/0x1c0 [nvidia]
[ 172.278664]  ? _nv003007rm+0x11/0x50 [nvidia]
[ 172.279015]  ? _nv002020rm+0x2680/0x3c80 [nvidia]
[ 172.279267]  ? _nv000654rm+0x2b9/0x340 [nvidia]
[ 172.279518]  ? rm_disable_adapter+0x6a/0x130 [nvidia]
[ 172.279771]  ? nv_uvm_notify_stop_device+0x46/0x60 [nvidia]
[ 172.280022]  ? nvidia_close+0x1a6/0x410 [nvidia]
[ 172.280275]  ? nvidia_frontend_close+0x4d/0xa0 [nvidia]
[ 172.280283]  ? __fput+0xe7/0x250
[ 172.280287]  ? ____fput+0xe/0x10
[ 172.280294]  ? task_work_run+0xa1/0xc0
[ 172.280299]  ? do_exit+0x368/0xa70
[ 172.280305]  ? poll_select_copy_remaining+0x130/0x130
[ 172.280310]  ? recalc_sigpending+0x1f/0x60
[ 172.280314]  ? do_group_exit+0x45/0xb0
[ 172.280319]  ? get_signal+0x2a9/0x760
[ 172.280324]  ? poll_select_copy_remaining+0x130/0x130
[ 172.280331]  ? do_signal+0x28/0xac0
[ 172.280335]  ? poll_select_copy_remaining+0x130/0x130
[ 172.280342]  ? pick_next_task_fair+0x6af/0x8b0
[ 172.280347]  ? read_tsc+0x9/0x10
[ 172.280353]  ? do_notify_resume+0x69/0xb0
[ 172.280359]  ? int_signal+0x12/0x17
[ 172.280397] Code: 17 75 00 31 c9 44 89 f2 be 2c 00 00 00 48 89 c7 ff 50 20 48 85 c0 49 89 c4 0f 84 d6 00 00 00 31 c9 31 d2 be 11 00 00 00 4c 89 e7 <41> ff 54 24 20 be 30 00 00 00 48 8b b8 a8 05 00
00 48 89 c3 ff
[ 172.280636] RIP  _nv002814rm+0x51d/0x610 [nvidia]
[ 172.280638] RSP
[ 172.280639] CR2: 0000000000000020
[ 172.280644] —[ end trace 9effa87fe1741704 ]—
[ 172.280647] Fixing recursive fault but reboot is needed!
Maybe the GPU/motherboard combo is flaky.
That AMD motherboard is pretty old. Have you disabled the on-board graphics?
You might also want to see if there are any BIOS updates for that motherboard. The latest BIOS update utility appears to have a 2016 date on it, whereas your BIOS appears to be from 2010.
I tried to find a BIOS update but couldn’t spot anything more recent – can you link me to the 2016 version you found?
I am planning to buy a new motherboard ASAP anyway, since my new Titan X is invisible to this one (I think we have exchanged some comments on another thread about that :).
I was looking at this page:
At the “live updtate” utility which is dated in 2016, but it says:
“Online update BIOS/Driver/Firmware/Utility. • Live Monitor auto-detects and suggests the latest BIOS/Driver/Utilities information.”
So it may not suggest any BIOS newer than the one you have. Plus it looks like you need to run windows on it to use that.
However it looks like you have a 2.7 BIOS and it looks like from this unofficial site that there may be a 2.9 BIOS avaiable (dated 2011):