GPU stops responding after some time and nvidia-smi reports that no device found

Problem:
GPU stops responding after some time (and nvidia-smi reports that no device found). Looks like it could not wake up from suspended state.

Steps to reproduce:

  1. Make sure GPU is not used by any process (nvidia-smi takes few seconds to execute)
  2. run nvidia-smi in loop
  3. wait

Problem happens after random period of time from few minutes to few hours. Problem happends on specific system:

CPU-Module: COMExpress T6 compact
CPU: i7-1185GRE
Linux kernel: 5.15.26
Nvidia driver: 470.63.01

With compact graphic cards. Two tested models:

  • Gainward RTX 3060 12 GB Pegasus
  • MSI RTX 3060 12 GB AERO ITX OC

Full-size cards most likely not affected (unable to reproduce on Asus 3060 Ti).

On another system mentioned compact cards most likely works fine (unable to reproduce with same two acards on another system).

Attached archive contains:

  • cpuinfo.log: /proc/cpuinfo dump
  • passX.before_failure.zip: archive with dumps before failure
  • passX.after_failure.zip: archive with dumps after failure

Each archive contains:

  • acpi.zip: acpidump output
  • nvidia-bug-report.log.gz
  • lspci.log: lspci -vvv dump
  • dmesg.log: dmesg dump

Found relations:

  • Problem does not happen if any process uses gpu
  • Problem does not happen when system completely idle
    report_data.zip (13.2 MB)

Probably this post is related to the problem (similar behavior, but completely different setup).

Please enable the nvidia-persistenced to start on boot, make sure it is continuously running and check if that resolves the issue.

It’s a workaround, but it does not solve the problem that GPU may be not initialized correctly.

It’s not a workaround but the supported configuration.
You’re running
turn gpu on-off-on-off-on-off-on-off-on-off-on-off-on-off-on-off-on-off-on-off-on-off…
whoa, it breaks. Congratulations.