Unable to determine device handle (TitanX - Ubuntu 18.04 - NVIDIA Driver 460.32, CUDA 11.2)

This is an error we have been repeatedly experiencing recently. We checked for connection issues and couldn’t find any. The code that triggers this issue, seems to run well on other GPU machines. The machine that gets this error has a TitanX with 280 GB RAM running Ubuntu 18.04


Rebooting the computer makes the GPU available again and tensorflow can find it.

pciBusID: 0000:3b:00.0 name: TITAN RTX computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 23.65GiB deviceMemoryBandwidth: 625.94GiB/s

2021-02-24 11:49:18.004118: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-02-24 11:49:18.873302: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-02-24 11:49:18.873352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0
2021-02-24 11:49:18.873363: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N
2021-02-24 11:49:18.876718: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:`

But it crashes and gets lost when allocating to the GPU memory, then trying to access the GPU via nvidia-smi gives the following message:

Unable to determine the device handle for GPU 0000:3B:00.0: GPU is lost. Reboot the system to recover this GP

Rebooting the computer makes the GPU available again and tensorflow can find it.

Logs for temperature at during a crash using nvidia-smi -q -d TEMPERATURE -l 2 -f temp.log:
==============NVSMI LOG==============
Timestamp : Wed Feb 24 12:04:12 2021
Driver Version : 460.32.03
CUDA Version : 11.2
Attached GPUs : 1
GPU 00000000:3B:00.0
Temperature
GPU Current Temp : 32 C
GPU Shutdown Temp : 94 C
GPU Slowdown Temp : 91 C
GPU Max Operating Temp : 89 C
GPU Target Temperature : 84 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
==============NVSMI LOG==============
Timestamp : Wed Feb 24 12:04:14 2021
Driver Version : 460.32.03
CUDA Version : 11.2
Attached GPUs : 1
GPU 00000000:3B:00.0
Temperature
GPU Current Temp : GPU is lost
GPU Shutdown Temp : GPU is lost
GPU Slowdown Temp : GPU is lost
GPU Max Operating Temp : GPU is lost
GPU Target Temperature : GPU is lost
Memory Current Temp : N/A
Memory Max Operating Temp : N/A

Similarly logs for power with nvidia-smi -q -d Power -l 2 -f power.log
==============NVSMI LOG==============
Timestamp : Wed Feb 24 12:04:13 2021
Driver Version : 460.32.03
CUDA Version : 11.2
Attached GPUs : 1
GPU 00000000:3B:00.0
Power Readings
Power Management : Supported
Power Draw : 97.75 W
Power Limit : 280.00 W
Default Power Limit : 280.00 W
Enforced Power Limit : 280.00 W
Min Power Limit : 100.00 W
Max Power Limit : 320.00 W
Power Samples
Duration : 18446744073709.51 sec
Number of Samples : 119
Max : 101.67 W
Min : 57.06 W
Avg : 0.00 W
==============NVSMI LOG==============
Timestamp : Wed Feb 24 12:04:15 2021
Driver Version : 460.32.03
CUDA Version : 11.2
Attached GPUs : 1
GPU 00000000:3B:00.0
Power Readings
Power Management : GPU is lost
Power Draw : GPU is lost
Power Limit : GPU is lost
Default Power Limit : GPU is lost
Enforced Power Limit : GPU is lost
Min Power Limit : GPU is lost
Max Power Limit : GPU is lost
Power Samples
Duration : GPU is lost
Number of Samples : GPU is lost
Max : GPU is lost
Min : GPU is lost
Avg : GPU is lost

The crash doesn’t seem no to be related to power or temperature issues.

Kernel logs sudo journalctl -b > today.log look like this:

Feb 24 12:04:14 knight kernel: nvidia-gpu 0000:3b:00.3: Refused to change power state, currently in D3
Feb 24 12:04:14 knight kernel: xhci_hcd 0000:3b:00.2: Refused to change power state, currently in D3
Feb 24 12:04:14 knight kernel: xhci_hcd 0000:3b:00.2: Refused to change power state, currently in D3
Feb 24 12:04:14 knight kernel: xhci_hcd 0000:3b:00.2: Controller not ready at resume -19
Feb 24 12:04:14 knight kernel: xhci_hcd 0000:3b:00.2: PCI post-resume error -19!
Feb 24 12:04:14 knight kernel: xhci_hcd 0000:3b:00.2: HC died; cleaning up
Feb 24 12:04:14 knight upowerd[1638]: unhandled action ‘offline’ on /sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.2/usb3
Feb 24 12:04:14 knight kernel: NVRM: GPU at PCI:0000:3b:00: GPU-3d360015-8487-3d44-4a59-d8163e72c2f7
Feb 24 12:04:14 knight kernel: NVRM: GPU Board Serial Number: 1320120027025
Feb 24 12:04:14 knight kernel: NVRM: Xid (PCI:0000:3b:00): 79, pid=0, GPU has fallen off the bus.
Feb 24 12:04:14 knight kernel: NVRM: GPU 0000:3b:00.0: GPU has fallen off the bus.
Feb 24 12:04:14 knight kernel: NVRM: GPU 0000:3b:00.0: GPU is on Board 1320120027025.
Feb 24 12:04:14 knight kernel: NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
Feb 24 12:04:15 knight kernel: nvidia-gpu 0000:3b:00.3: i2c timeout error ffffffff
Feb 24 12:04:15 knight kernel: ucsi_ccg 0-0008: i2c_transfer failed -110
Feb 24 12:05:03 knight /usr/lib/gdm3/gdm-x-session[2993]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0218, 0x00006694, 0x00006738)

The nvidia-big-report file generated by executing nvidia-bug-report.sh is attached

The first time the GPU crashed the kernel log was:
Feb 23 21:56:46 knight kernel: NVRM: GPU at PCI:0000:3b:00: GPU-3d360015-8487-3d44-4a59-d8163e72c2f7
Feb 23 21:56:46 knight kernel: NVRM: GPU Board Serial Number: 1320120027025
Feb 23 21:56:46 knight kernel: NVRM: Xid (PCI:0000:3b:00): 79, pid=0, GPU has fallen off the bus.
Feb 23 21:56:46 knight kernel: NVRM: GPU 0000:3b:00.0: GPU has fallen off the bus.
Feb 23 21:56:46 knight kernel: NVRM: GPU 0000:3b:00.0: GPU is on Board 1320120027025.
Feb 23 21:56:46 knight kernel: NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
Feb 23 21:56:46 knight kernel: nvidia-gpu 0000:3b:00.3: Refused to change power state, currently in D3
Feb 23 21:56:46 knight kernel: xhci_hcd 0000:3b:00.2: Refused to change power state, currently in D3
Feb 23 21:56:46 knight kernel: xhci_hcd 0000:3b:00.2: Refused to change power state, currently in D3
Feb 23 21:56:46 knight kernel: xhci_hcd 0000:3b:00.2: Controller not ready at resume -19
Feb 23 21:56:46 knight kernel: xhci_hcd 0000:3b:00.2: PCI post-resume error -19!
Feb 23 21:56:46 knight kernel: xhci_hcd 0000:3b:00.2: HC died; cleaning up
Feb 23 21:56:46 knight upowerd[1611]: unhandled action ‘offline’ on /sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.2/usb3
Feb 23 21:56:47 knight kernel: nvidia-gpu 0000:3b:00.3: i2c timeout error ffffffff
Feb 23 21:56:47 knight kernel: ucsi_ccg 0-0008: i2c_transfer failed -110
Feb 23 21:57:11 knight gnome-shell[1604]: JS ERROR: TypeError: this._trackedWindows.get(…) is undefined
_onWindowActorRemoved@resource:///org/gnome/shell/ui/panel.js:836:9
wrapper@resource:///org/gnome/gjs/modules/_legacy.js:82:22
_initializeUI/<@resource:///org/gnome/shell/ui/main.js:206:9
Feb 23 21:57:11 knight polkitd(authority=local)[1355]: Unregistered Authentication Agent for unix-session:c1 (system bus name :1.31, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8) (disconnected from bus)
Feb 23 21:57:11 knight kernel: nvidia-modeset: WARNING: GPU:0: Failure processing EDID for display device DELL U2417H (DP-0).
Feb 23 21:57:11 knight kernel: nvidia-modeset: WARNING: GPU:0: Unable to read EDID for display device DELL U2417H (DP-0)
Feb 23 21:57:11 knight kernel: nvidia-modeset: ERROR: GPU:0: Failure reading maximum pixel clock value for display device DP-0.
Feb 23 21:57:11 knight /usr/lib/gdm3/gdm-x-session[1518]: (EE) NVIDIA(GPU-0): Unable to add conservative default mode “nvidia-auto-select”.
Feb 23 21:57:11 knight /usr/lib/gdm3/gdm-x-session[1518]: (EE) NVIDIA(GPU-0): Unable to add “nvidia-auto-select” mode to ModePool.
Feb 23 21:57:11 knight /usr/lib/gdm3/gdm-x-session[1518]: (–) NVIDIA(GPU-0): DFP-0: connected
Feb 23 21:57:11 knight /usr/lib/gdm3/gdm-x-session[1518]: (–) NVIDIA(GPU-0): DFP-0: Internal DisplayPort
Feb 23 21:57:11 knight /usr/lib/gdm3/gdm-x-session[1518]: (–) NVIDIA(GPU-0): DFP-0: 100.0 MHz maximum pixel clock
Feb 23 21:57:11 knight /usr/lib/gdm3/gdm-x-session[1518]: (–) NVIDIA(GPU-0):
Feb 23 21:57:11 knight /usr/lib/gdm3/gdm-x-session[1518]: (–) NVIDIA(GPU-0): DFP-1: disconnected
Feb 23 21:57:11 knight /usr/lib/gdm3/gdm-x-session[1518]: (–) NVIDIA(GPU-0): DFP-1: Internal TMDS
Feb 23 21:57:11 knight /usr/lib/gdm3/gdm-x-session[1518]: (–) NVIDIA(GPU-0): DFP-1: 165.0 MHz maximum pixel clock
Feb 23 21:57:11 knight /usr/lib/gdm3/gdm-x-session[1518]: (–) NVIDIA(GPU-0):

Feb 23 21:57:11 knight kernel: nvidia-modeset: ERROR: GPU:0: Failed detecting connected display devices


We would love to receive some suggestions or tips on how to fix this, or if it’s time to get a replacement from NVIDIA (we’ve barely had this GPU for year).

Best,
Bharathnvidia-bug-report.log (5.1 MB)

Hi!
We just recently had a Xid 79 from another user.
See this thread:

Maybe try the memcheck, test lowering the clock and if possible test with another PSU.
Also cross your fingers that someone from nvidia looks at the crashdump (though that is like hoping for a lottery win ;-) )