I have a 4060ti 16g set up as eGPU for my laptop, which comes with a 3050 mobile. My computer keeps losing GPUs while running tasks. The error message from running nvidia-smi
is Unable to determine the device handle for GPU0000:06:00.0: Unknown Error
. A reboot could solve the problem, but sometimes it keeps happening every few hours and it’s very frustrating.
$ lspci | grep -i VGA
01:00.0 VGA compatible controller: NVIDIA Corporation GA107BM [GeForce RTX 3050 Mobile] (rev a1)
06:00.0 VGA compatible controller: NVIDIA Corporation Device 2805 (rev ff)
The driver version is 535-open:
Running nvidia-smi
when GPUs are up:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3050 ... Off | 00000000:01:00.0 On | N/A |
| N/A 33C P8 4W / 60W | 1039MiB / 4096MiB | 11% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 4060 Ti Off | 00000000:06:00.0 Off | N/A |
| 0% 34C P8 6W / 165W | 18MiB / 16380MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
OS: Kubuntu 22.04
Here is the nvidia-bug-report.log :
nvidia-bug-report.log.gz (269.2 KB)