Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error after executing nvidia-smi

Our company has an AI model working with NVIDIA GeForce RTX 3060

We have deployed a lot of units in the field but 6 of them came up recently with “Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error after executing nvidia-smi”

This also stops our camera and affects our camera module. We restart the unit and it works fine for a day but then in 24 hours it comes back into that error. We also had the issue where it worked fine for a month after the restart, two weeks, a week but then they all come back to that 1 day or even 3 hours of working fine before you have to reboot the device again.

I tried to reinstall the driver, to reinstall a newer driver or even reinstall the kernel (headers and generics) but it didn’t work.

I checked the temperature on the unit as per other topics that I found but the unit is currently at 65 degrees and goes to a max of 85 during peek times.

Not sure why this is coming up but seems to be happening more and more on both new and older units (all with similar parts and configurations) and I’ve been stressing on this for the past few months with no results.

Current system:
OS: Debian ~20.04.1-Ubuntu SMP x86_64 GNU/Linux
Kernel: 5.15.0-45-genericx 86_64
CPU: Intel(R) Core™ i5-10400 CPU @ 2.90GHz
GPU: NVIDIA GeForce RTX 3060
Nvidia driver: 515.105.01 (Usually 515.45.01)
CUDA version: 11.7

nvidia-debugdump -l
Found 1 NVIDIA devices
Device ID: 0
Device name: NVIDIA GeForce RTX 3060
GPU internal ID: GPU-867d3882-6beb-011c-c7cd-aa82c55e1b3e

Log File

nvidia-bug-report.log (1.4 MB)

nvidia-debugdump -z -D
nvmlInit succeeded
Using ALL devices
Dumping all components.
nvdZip_Open(dump.zip) for writing succeeded
System: Dumping component: system_info.
GetCaptureBufferSize succeeded, bufSize: 0x139
GetCaptureBuffer succeeded, bufSize: 0xff
nvdZip_AddFile succeeded
internal_dumpSystemComponent() succeeded
System: Dumping component: error_data.
GetCaptureBufferSize succeeded, bufSize: 0x146
GetCaptureBuffer succeeded, bufSize: 0x10c
nvdZip_AddFile succeeded
internal_dumpSystemComponent() succeeded
ERROR: internal_dumpNvLogComponent() failed, return code: 0x3e7
Device: NVIDIA GeForce RTX 3060 : 0: Dumping component: debug_buffers.
GetCaptureBufferSize succeeded, bufSize: 0x22
ERROR: GetCaptureBuffer failed, Unknown Error, bufSize: 0x22
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
Device: NVIDIA GeForce RTX 3060 : 0: Dumping component: rm.
GetCaptureBufferSize succeeded, bufSize: 0x5783
ERROR: GetCaptureBuffer failed, Unknown Error, bufSize: 0x5783
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: internal_dumpNvLogComponent() failed, return code: 0x3e7
nvdZip_Close() succeeded

The issue is somewhat related to : Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error after executing nvidia-smi

The gpu is turned off, I’d suspect a psu issue.

I got another debugdump and some errors have cleared. Where did you notice that GPU is turned off?

Just want to see if I can cross check the info with another unit that we have live. There’s also another unit that I’ve recovered from the field which had a similar issue so want to see if a PSU swap will indeed solve this. Any idea of other checks that I can do?

nvidia-debugdump -z -D
nvmlInit succeeded
Using ALL devices
Dumping all components.
nvdZip_Open(dump.zip) for writing succeeded
System: Dumping component: system_info.
GetCaptureBufferSize succeeded, bufSize: 0x139
GetCaptureBuffer succeeded, bufSize: 0xff
nvdZip_AddFile succeeded
internal_dumpSystemComponent() succeeded
System: Dumping component: error_data.
GetCaptureBufferSize succeeded, bufSize: 0x1b8f
GetCaptureBuffer succeeded, bufSize: 0x18f1
nvdZip_AddFile succeeded
internal_dumpSystemComponent() succeeded
ERROR: internal_dumpNvLogComponent() failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: internal_dumpNvLogComponent() failed, return code: 0x3e7
nvdZip_Close() succeeded

The pci config space of the device is all 0xff meaning the device is turned off.