Our company has an AI model working with NVIDIA GeForce RTX 3060
We have deployed a lot of units in the field but 6 of them came up recently with “Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error after executing nvidia-smi”
This also stops our camera and affects our camera module. We restart the unit and it works fine for a day but then in 24 hours it comes back into that error. We also had the issue where it worked fine for a month after the restart, two weeks, a week but then they all come back to that 1 day or even 3 hours of working fine before you have to reboot the device again.
I tried to reinstall the driver, to reinstall a newer driver or even reinstall the kernel (headers and generics) but it didn’t work.
I checked the temperature on the unit as per other topics that I found but the unit is currently at 65 degrees and goes to a max of 85 during peek times.
Not sure why this is coming up but seems to be happening more and more on both new and older units (all with similar parts and configurations) and I’ve been stressing on this for the past few months with no results.
Current system:
OS: Debian ~20.04.1-Ubuntu SMP x86_64 GNU/Linux
Kernel: 5.15.0-45-genericx 86_64
CPU: Intel(R) Core™ i5-10400 CPU @ 2.90GHz
GPU: NVIDIA GeForce RTX 3060
Nvidia driver: 515.105.01 (Usually 515.45.01)
CUDA version: 11.7
nvidia-debugdump -l
Found 1 NVIDIA devices
Device ID: 0
Device name: NVIDIA GeForce RTX 3060
GPU internal ID: GPU-867d3882-6beb-011c-c7cd-aa82c55e1b3e
Log File
nvidia-bug-report.log (1.4 MB)
nvidia-debugdump -z -D
nvmlInit succeeded
Using ALL devices
Dumping all components.
nvdZip_Open(dump.zip) for writing succeeded
System: Dumping component: system_info.
GetCaptureBufferSize succeeded, bufSize: 0x139
GetCaptureBuffer succeeded, bufSize: 0xff
nvdZip_AddFile succeeded
internal_dumpSystemComponent() succeeded
System: Dumping component: error_data.
GetCaptureBufferSize succeeded, bufSize: 0x146
GetCaptureBuffer succeeded, bufSize: 0x10c
nvdZip_AddFile succeeded
internal_dumpSystemComponent() succeeded
ERROR: internal_dumpNvLogComponent() failed, return code: 0x3e7
Device: NVIDIA GeForce RTX 3060 : 0: Dumping component: debug_buffers.
GetCaptureBufferSize succeeded, bufSize: 0x22
ERROR: GetCaptureBuffer failed, Unknown Error, bufSize: 0x22
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
Device: NVIDIA GeForce RTX 3060 : 0: Dumping component: rm.
GetCaptureBufferSize succeeded, bufSize: 0x5783
ERROR: GetCaptureBuffer failed, Unknown Error, bufSize: 0x5783
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: internal_dumpNvLogComponent() failed, return code: 0x3e7
nvdZip_Close() succeeded
The issue is somewhat related to : Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error after executing nvidia-smi