RTX 6000 ADA: Unable to determine the device handle for GPU0000:42:00.0: Unknown Error

Hello,

I have a PC with 6000 ADA GPUs for deep learning experiments. A week ago, while running experiments, one of the GPUs stopped working, the fan went to max speed, and nvidia-smi command changed to only output:

“Unable to determine the device handle for GPU0000:42:00.0: Unknown Error”

I tried stopping my code, and did sudo poweroff, but even after that the GPU fan was still running at max. I had to hold the power button. I did not know I needed to collect a bug report, from the first failure.

This happened again today. I ran a bug report, which I’ve uploaded. When this happened, I tried to fix by rebooting it, but that didn’t work, the fan kept going, and the PC never rebooted. Again I had to power it off. After manual power off and back on, it works.

I don’t think this is related to overheating, the PC is in a mining rig so there is good airflow, and the GPUs never get above 82C.

Both times it was the same GPU that failed.

These are new GPUs, I’ve only had them a few weeks.

What is happening and how can I prevent this?

Also, is there any way to fix this remotely without needing to manually power it off?

Thanks!

Misc information:

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

I have 6x RTX 6000 ADA GPUs. To plug them all into the motherboard I needed to use Risers. I have two Be Quiet power supplies, each plugged into a separate circuit. Each power supply powers 3 GPUs, and one of the power supplies powers all other components.

From nvidia-smi: Driver Version: 550.67 CUDA Version: 12.4

OS ubuntu 24.04

uname -r
6.8.0-31-generic

nvidia-bug-report.log.gz (2.8 MB)

The logs are flooded with pcie errors so the gpu are falling off the bus. Simply bad/incapable risers. Please check your bios to limit pcie to gen 3.

Thank you so much! Will do

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.