GPU stops responding after some time (and nvidia-smi reports that no device found). Looks like it could not wake up from suspended state.
Steps to reproduce:
- Make sure GPU is not used by any process (nvidia-smi takes few seconds to execute)
- run nvidia-smi in loop
Problem happens after random period of time from few minutes to few hours. Problem happends on specific system:
CPU-Module: COMExpress T6 compact
Linux kernel: 5.15.26
Nvidia driver: 470.63.01
With compact graphic cards. Two tested models:
- Gainward RTX 3060 12 GB Pegasus
- MSI RTX 3060 12 GB AERO ITX OC
Full-size cards most likely not affected (unable to reproduce on Asus 3060 Ti).
On another system mentioned compact cards most likely works fine (unable to reproduce with same two acards on another system).
Attached archive contains:
- cpuinfo.log: /proc/cpuinfo dump
- passX.before_failure.zip: archive with dumps before failure
- passX.after_failure.zip: archive with dumps after failure
Each archive contains:
- acpi.zip: acpidump output
- lspci.log: lspci -vvv dump
- dmesg.log: dmesg dump
- Problem does not happen if any process uses gpu
- Problem does not happen when system completely idle
report_data.zip (13.2 MB)
Probably this post is related to the problem (similar behavior, but completely different setup).