GPU loss

Hello,

My OS is ubuntu 14.04(64bit), GPU is GTX 1080, when i use GPU tu run cuda program, i found GPU is loss, below i found some information, what’s problem this? and how to fix it?

~$ nvidia-smi
Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost. Reboot the system to recover this GPU

~$ sudo dmesg |tail -n 100

[ 13.886088] audit: type=1400 audit(1500106410.753:6): apparmor=“STATUS” operation=“profile_replace” profile=“unconfined” name=“/usr/lib/NetworkManager/nm-dhcp-client.action” pid=554 comm=“apparmor_parser”
[ 13.886090] audit: type=1400 audit(1500106410.753:7): apparmor=“STATUS” operation=“profile_replace” profile=“unconfined” name=“/usr/lib/connman/scripts/dhclient-script” pid=554 comm=“apparmor_parser”
[ 13.886309] audit: type=1400 audit(1500106410.753:8): apparmor=“STATUS” operation=“profile_replace” profile=“unconfined” name=“/usr/lib/NetworkManager/nm-dhcp-client.action” pid=555 comm=“apparmor_parser”
[ 13.886312] audit: type=1400 audit(1500106410.753:9): apparmor=“STATUS” operation=“profile_replace” profile=“unconfined” name=“/usr/lib/connman/scripts/dhclient-script” pid=555 comm=“apparmor_parser”
[ 13.886321] audit: type=1400 audit(1500106410.753:10): apparmor=“STATUS” operation=“profile_replace” profile=“unconfined” name=“/usr/lib/NetworkManager/nm-dhcp-client.action” pid=554 comm=“apparmor_parser”
[ 13.886324] audit: type=1400 audit(1500106410.753:11): apparmor=“STATUS” operation=“profile_replace” profile=“unconfined” name=“/usr/lib/connman/scripts/dhclient-script” pid=554 comm=“apparmor_parser”
[ 14.154335] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input10
[ 14.154895] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input11
[ 14.155470] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input12
[ 14.155513] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input13
[ 14.160505] init: Failed to obtain startpar-bridge instance: Unknown parameter: INSTANCE
[ 14.427308] media: Linux media interface: v0.10
[ 14.476952] FS-Cache: Loaded
[ 14.480916] RPC: Registered named UNIX socket transport module.
[ 14.480917] RPC: Registered udp transport module.
[ 14.480918] RPC: Registered tcp transport module.
[ 14.480919] RPC: Registered tcp NFSv4.1 backchannel transport module.
[ 14.485782] FS-Cache: Netfs ‘nfs’ registered for caching
[ 14.491110] Linux video capture interface: v2.00
[ 14.491504] usbcore: registered new interface driver snd-usb-audio
[ 14.491717] Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
[ 14.497181] init: failsafe main process (706) killed by TERM signal
[ 14.861866] uvcvideo: Found UVC 1.00 device USB 2.0 PC Camera (058f:0362)
[ 14.864888] input: USB 2.0 PC Camera as /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.1/2-1.1:1.0/input/input14
[ 14.864980] usbcore: registered new interface driver uvcvideo
[ 14.864981] USB Video Class driver (1.1.1)
[ 14.873898] Bluetooth: Core ver 2.19
[ 14.873909] NET: Registered protocol family 31
[ 14.873910] Bluetooth: HCI device and connection manager initialized
[ 14.873915] Bluetooth: HCI socket layer initialized
[ 14.873916] Bluetooth: L2CAP socket layer initialized
[ 14.873921] Bluetooth: SCO socket layer initialized
[ 14.875564] Bluetooth: BNEP (Ethernet Emulation) ver 1.3
[ 14.875565] Bluetooth: BNEP filters: protocol multicast
[ 14.875569] Bluetooth: BNEP socket layer initialized
[ 14.877592] Bluetooth: RFCOMM TTY layer initialized
[ 14.877596] Bluetooth: RFCOMM socket layer initialized
[ 14.877599] Bluetooth: RFCOMM ver 1.11
[ 15.045405] init: idmapd main process (791) terminated with status 1
[ 15.045412] init: idmapd main process ended, respawning
[ 15.314934] init: cups main process (848) killed by HUP signal
[ 15.314940] init: cups main process ended, respawning
[ 15.784897] nvidia: module license ‘NVIDIA’ taints kernel.
[ 15.784899] Disabling lock debugging due to kernel taint
[ 15.787643] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 15.790887] vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=io+mem,decodes=none:owns=none
[ 15.790947] nvidia-nvlink: Nvlink Core is being initialized, major device number 249
[ 15.790958] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 367.57 Mon Oct 3 20:37:01 PDT 2016
[ 15.795639] [drm] Initialized drm 1.1.0 20060810
[ 15.799973] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 367.57 Mon Oct 3 20:32:57 PDT 2016
[ 15.804224] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 15.816677] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 248
[ 16.509959] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
[ 16.545768] NFSD: starting 90-second grace period (net ffffffff81cd3700)
[ 16.705591] nvidia 0000:01:00.0: irq 44 for MSI/MSI-X
[ 16.706266] systemd-udevd[1183]: failed to execute ‘/bin/systemctl’ ‘/bin/systemctl start --no-block nvidia-persistenced.service’: No such file or directory
[ 17.171930] r8169 0000:03:00.0 eth0: link down
[ 17.171967] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 17.238625] r8169 0000:03:00.0 eth0: link down
[ 17.407393] NVRM: Your system is not currently configured to drive a VGA console
[ 17.407395] NVRM: on the primary VGA device. The NVIDIA Linux graphics driver
[ 17.407396] NVRM: requires the use of a text-mode VGA console. Use of other console
[ 17.407397] NVRM: drivers including, but not limited to, vesafb, may result in
[ 17.407398] NVRM: corruption and stability problems, and is not supported.
[ 17.911045] systemd-udevd[1261]: failed to execute ‘/bin/systemctl’ ‘/bin/systemctl stop --no-block nvidia-persistenced’: No such file or directory
[ 18.065184] init: samba-ad-dc main process (1223) terminated with status 1
[ 18.298906] init: plymouth-splash main process (1304) terminated with status 1
[ 18.324146] init: nvidia-prime main process (1313) terminated with status 127
[ 19.278272] r8169 0000:03:00.0 eth0: link up
[ 19.278279] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 39.636192] nvidia 0000:01:00.0: irq 44 for MSI/MSI-X
[ 40.190429] nvidia-modeset: Allocated GPU:0 (GPU-1ff6e59a-3b28-4d4f-63ee-eaea8ed6a819) @ PCI:0000:01:00.0
[ 43.774744] cgroup: systemd-logind (905) created nested cgroup for controller “memory” which has incomplete hierarchy support. Nested cgroups may change behavior in the future.
[ 43.774746] cgroup: “memory” requires setting use_hierarchy to 1 on the root
[ 45.106907] audit_printk_skb: 171 callbacks suppressed
[ 45.106909] audit: type=1400 audit(1500106441.973:69): apparmor=“STATUS” operation=“profile_replace” profile=“unconfined” name=“/usr/lib/cups/backend/cups-pdf” pid=1866 comm=“apparmor_parser”
[ 45.106914] audit: type=1400 audit(1500106441.973:70): apparmor=“STATUS” operation=“profile_replace” profile=“unconfined” name=“/usr/sbin/cupsd” pid=1866 comm=“apparmor_parser”
[ 45.107141] audit: type=1400 audit(1500106441.973:71): apparmor=“STATUS” operation=“profile_replace” profile=“unconfined” name=“/usr/sbin/cupsd” pid=1866 comm=“apparmor_parser”
[ 45.563825] aufs 3.x-rcN-20140707
[ 45.771065] Bridge firewalling registered
[ 45.772958] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[ 45.774476] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready
[ 45.876721] ip_tables: (C) 2000-2006 Netfilter Core Team
[ 48.115249] audit: type=1400 audit(1500106444.979:72): apparmor=“STATUS” operation=“profile_replace” profile=“unconfined” name=“docker-default” pid=1990 comm=“apparmor_parser”
[ 157.750181] uvcvideo: Failed to query (GET_DEF) UVC control 13 on unit 1: -32 (exp. 8).
[ 172.758761] systemd-hostnamed[3074]: Warning: nss-myhostname is not installed. Changing the local hostname might make it unresolveable. Please install nss-myhostname!
[ 371.842601] nvidia-modeset: Freed GPU:0 (GPU-1ff6e59a-3b28-4d4f-63ee-eaea8ed6a819) @ PCI:0000:01:00.0
[ 371.908003] nvidia 0000:01:00.0: irq 44 for MSI/MSI-X
[ 372.456433] nvidia-modeset: Allocated GPU:0 (GPU-1ff6e59a-3b28-4d4f-63ee-eaea8ed6a819) @ PCI:0000:01:00.0
[ 372.560799] sound hdaudioC1D0: HDMI: invalid ELD data byte 3
[ 388.402764] sogou-qimpanel[3887]: segfault at 0 ip 00007f38018fed3c sp 00007ffd18f95728 error 4 in libc-2.19.so[7f380187c000+1ba000]
[ 633.020031] NVRM: GPU at PCI:0000:01:00: GPU-1ff6e59a-3b28-4d4f-63ee-eaea8ed6a819
[ 633.020035] NVRM: GPU Board Serial Number:
[ 633.020037] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
[ 633.020037]
[ 633.020039] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
[ 633.020040] NVRM: GPU is on Board .
[ 633.020046] NVRM: A GPU crash dump has been created. If possible, please run
[ 633.020046] NVRM: nvidia-bug-report.sh as root to collect this data before
[ 633.020046] NVRM: the NVIDIA kernel module is unloaded.

Hello,

I am also running into the similar problem and not sure what would cause this occurrence! Please advise if this is already a known problem and if there’s a fix available please point me to the solution info/page/URL.

nvidia-smi
Unable to determine the device handle for GPU 0000:03:00.0: GPU is lost. Reboot the system to recover this GPU

dmesg | tail -n 10
nvidia-modeset: Allocated GPU:0 (GPU-3c93a8a7-6d53-9c02-1f2a-4484502ae941) @ PCI:0000:03:00.0
NVRM: GPU at PCI:0000:03:00: GPU-3c93a8a7-6d53-9c02-1f2a-4484502ae941
NVRM: GPU Board Serial Number: XXXXXXXXXXXXX
NVRM: Xid (PCI:0000:03:00): 79, GPU has fallen off the bus.

NVRM: GPU at 0000:03:00.0 has fallen off the bus.
NVRM: GPU is on Board 0324610097302.
NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.


CentOS release 6.4 (Final)
2.6.32-358.11.1.el6.x86_64
Quadro 4000 Driver Version: 375.26 (from historical records, as the nvidia-smi is not working currently)

This is often connected to overheating or failure of power source.

Thank you! That’s a helpful info. Is there a way to set a temperature threshold (assuming power source is not failing). I am actually looking for a way to prevent this, if there’s a way to handle this behavior.

I think the thresholds for throttling and shutdown temperatures are fixed. You should rather check if overheating is the case using nvidia-smi monitoring in the backgroud and then fix the reason (dust/airflow).
Otherwise you can only nail the GPU to a lower performance level so it produces less heat.

Sure, will give this a go to keep check on the temperature readings at set time intervals to find out how frequently it spikes and anything blocking the heat from dissipating. Will also lookup for the ideal operating temperature range and will setup a trigger to send out an alert notification upon crossing the temperature to a set threshold level. Thanks for sharing the thoughts.

hi, i met a similar problem: when i input nvidia-smi, i found the host loss one GPU info.
when i run dmesg | tail -n 10 , it shows;

dmesg | tail -n 10
[1056522.930744] NVRM: RmInitAdapter failed! (0x24:0x65:1088)
[1056522.930783] NVRM: rm_init_adapter failed for device bearing minor number 6
[1056527.198836] NVRM: RmInitAdapter failed! (0x24:0x65:1088)
[1056527.198884] NVRM: rm_init_adapter failed for device bearing minor number 6
[1056531.785549] NVRM: RmInitAdapter failed! (0x24:0x65:1088)
[1056531.785596] NVRM: rm_init_adapter failed for device bearing minor number 6
[1056536.112207] NVRM: RmInitAdapter failed! (0x24:0x65:1088)
[1056536.112241] NVRM: rm_init_adapter failed for device bearing minor number 6
[1056540.668006] NVRM: RmInitAdapter failed! (0x24:0x65:1088)
[1056540.668038] NVRM: rm_init_adapter failed for device bearing minor number 6

system is Ubutnu18.04, Cuda10.0.130

i don`t konw how to do

give this a go: https://devtalk.nvidia.com/default/topic/1043952/linux-driver-410-73-gtx-980-nvrm-rminitadapter-failed-/