At first, nvidia-smi was fine, here is previous output:
root@isysresearch:~/notebooks# nvidia-smi
Thu Feb 22 06:35:17 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2080 Ti Off | 00000000:A1:00.0 On | N/A |
| 0% 46C P8 7W / 300W | 336MiB / 11264MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A10 Off | 00000000:C1:00.0 Off | 0 |
| 0% 41C P8 9W / 150W | 16MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1514 G /usr/lib/xorg/Xorg 61MiB |
| 0 N/A N/A 1755 C+G ...libexec/gnome-remote-desktop-daemon 156MiB |
| 0 N/A N/A 1795 G /usr/bin/gnome-shell 109MiB |
| 0 N/A N/A 2491 G /opt/teamviewer/tv_bin/TeamViewer 2MiB |
| 1 N/A N/A 1514 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
After I run:
import cupy as cp
x_gpu = cp.array([1,2,3])
print(x_gpu)
It returned expected output:
array([1,2,3])
But after that moment, It returned cuda runtime error:
Now It returned error like title at nvidia-smi output:
Unable to determine the device handle for GPU0000:C1:00.0: Unknown Error
Here is the log file:
nvidia-bug-report.log.gz (252.4 KB)
Rebooting PC fixed this, but if I run that code again (im not sure it caused by code), then it started error again. I’m using ubuntu 22.04 LTS btw.
