Unable to determine the device handle for GPU 0000:68:00.0: Unknown Error

OS: Ubuntu 16.04LTS
Driver Version: 410.93
GPUs: 4 1080Ti

nvidia-smi

Unable to determine the device handle for GPU 0000:68:00.0: Unknown Error

nvidia-debugdump --list

Found 4 NVIDIA devices
	Device ID:              0
	Device name:            GeForce GTX 1080 Ti
	GPU internal ID:        GPU-cd0a4246-432d-0b6d-4954-36b59c2d435d

	Device ID:              1
	Device name:            GeForce GTX 1080 Ti
	GPU internal ID:        GPU-c8312877-87c2-8e4d-92c7-d34e0da7c997

	Device ID:              2
	Device name:            GeForce GTX 1080 Ti
	GPU internal ID:        GPU-04c2af78-ae03-7905-d4a4-9e576d030cc5

Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x3): Unknown Error

nvidia-bug-report.log.gz (1.98 MB)

The GPU in slot 68 doesn’t answer at all, please check for properly connected power connectors, reseat card, check in another system for a general hardware failure.

I’m having the same issue with my system. Can anyone have a look at my bug report?

OS: CentOS 7.6
Driver Version: 460.32
GPUs: 2 x RTX2070

The problem occur randomly when I training network on pytorch

RuntimeError: CUDA error: unspecified launch failure

Then running nvidia-smi give me this error. After restart, everything go back to normal.

Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error

nvidia-debugdump --list

Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error

Here is my bug report
nvidia-bug-report.log.gz (1.0 MB)

Thanks so much!

Your cpu is always complaining about overheating and at some time the gpu that drives the Xserver crashes with XID32, possibly system memory related (faulty? overheated?). Furthermore, it’s not advisable to use the same gpu you’re using to drive the Xserver to train n. networks using cuda at the same time.

1 Like

I have the same problem as the poster taod_dqc
nvidia-smi

Unable to determine the device handle for GPU 0000:68:00.0: Unknown Error

and
nvidia-debugdump --list

Found 4 NVIDIA devices
        Device ID:              0
        Device name:            TITAN Xp
        GPU internal ID:        0322218016170

        Device ID:              1
        Device name:            TITAN Xp
        GPU internal ID:        0322218014970

        Device ID:              2
        Device name:            TITAN Xp
        GPU internal ID:        0322118181128

Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x3): Unknown Error

Here is my bug report

nvidia-bug-report.log.gz (2.2 MB)

While waiting for support from nvidia, I think you could do a quick check on its power connector. Mine was because of a loose connection in one of the 2 8-pin connectors. Then, after a long time of few full load, it defects totally & I’ve to RMA it

1 Like