Unable to determine the device handle for GPU0000:C1:00.0: Unknown Error

At first, nvidia-smi was fine, here is previous output:

root@isysresearch:~/notebooks# nvidia-smi
Thu Feb 22 06:35:17 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:A1:00.0  On |                  N/A |
|  0%   46C    P8               7W / 300W |    336MiB / 11264MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A10                     Off | 00000000:C1:00.0 Off |                    0 |
|  0%   41C    P8               9W / 150W |     16MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1514      G   /usr/lib/xorg/Xorg                           61MiB |
|    0   N/A  N/A      1755    C+G   ...libexec/gnome-remote-desktop-daemon      156MiB |
|    0   N/A  N/A      1795      G   /usr/bin/gnome-shell                        109MiB |
|    0   N/A  N/A      2491      G   /opt/teamviewer/tv_bin/TeamViewer             2MiB |
|    1   N/A  N/A      1514      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

After I run:

import cupy as cp

x_gpu = cp.array([1,2,3])
print(x_gpu)

It returned expected output:

array([1,2,3])

But after that moment, It returned cuda runtime error:

Now It returned error like title at nvidia-smi output:

Unable to determine the device handle for GPU0000:C1:00.0: Unknown Error

Here is the log file:
nvidia-bug-report.log.gz (252.4 KB)

Rebooting PC fixed this, but if I run that code again (im not sure it caused by code), then it started error again. I’m using ubuntu 22.04 LTS btw.

You’re getting an Xid 79, fallen off the bus. I’d suspect overheating, please monitor temperatures using nvidia smi.

Currently I don’t have fan for A10. Is there temporary solution such as underclocking?

The fanless enterprise gpus must never be used without external cooling, even at lowest clocks they would overheat.


You are right, it’s caused by overheat even after I tried underclocked it.

nvidia-smi -i 1 -pl=100
nvidia-smi -i 1 -lgc=200
nvidia-smi -i 1 -lmc=200
nvidia-smi -i 1 -ac=405,210