RTX 3090: GPU has fallen off the bus (only Linux, on Windows everything is fine)

Dear nVidia-Community,

since I installed Linux Mint last summer next to my Windows installation on my Desktop PC, I’m using it mostly as my daily driver and I would be pretty happy with it - if there were not one thing: unfortunately from the beginning I experience freezes. They only occur on Mint, not on the alongside installed Windows 11. Sometimes I have a complete day without freeze, sometimes it occurs 2-3 times a day.
When the freeze occurs the system doesn’t react to anything: no switching with CTRL+ALT+Fx, no CRTL+ALT+DEL… I need to switch the PC completely off via the power button. If I have playing music in the background while the freeze occurs (sometimes I do, sometimes I don’t) the music keeps playing in an endless loop of a ~1 second portion of the song.

I connected from another Machine to my Desktop PC via SSH and let “dmesg --follow” running. The moment that the Desktop dies again, I see the following output from dmesg:

[16257.752527] NVRM: GPU at PCI:0000:02:00: GPU-a47ca1a7-7995-cc50-707f-d504155773f0
[16257.752540] NVRM: Xid (PCI:0000:02:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[16257.752544] NVRM: GPU 0000:02:00.0: GPU has fallen off the bus.
[16257.802517] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.
[16299.025739] userif-3: sent link down event.
[16299.025751] userif-3: sent link up event.

Running nvidia-bug-report.sh before switching the system off and on is not possible because also the SSH session dies a few seconds later.

There is not an obvious activity how to trigger the freeze, it occurs on several situations including:

  • switching from one Window to the other, e.g. from Thunderbird to Chromium
  • scrolling through a website, e.g. through Amazon on Firefox
  • working in an RDP-session

So nothing where the system would be under load. In the contrary: sometimes I can play a Steam game for six hours or render a movie in handbrake without problems with the fans working pretty hard and everything seems fine.
Freezes seem to happen mostly in “banal” situations.

What I already tried during that past months:

  • Updating UEFI
  • uninstalling the NVidia-Driver and using the nouveau driver (freezes also occured, so I reinstalled it)
  • using NVidia 525 and 470 instead of 535 (freezes occurs with all 3, so I’m using 535 again now)
  • switching to 6.5 Kernel (makes no difference, but I’m currently still using it)
  • Checking GPU Temperature via NVIDIA Settings Application and other Temperatures with “sensors” in Terminal (everything fine)

I found some threads in this forum regarding the same issue where replies go in the direction of “Hardware issue” or “power supply problem”.
But I wouldn’t see how it would be a hardware problem because on Windows the issue does not occur at all, only on Linux.

Any help how to debug and solve this issue would be very much appreciated!

Thanks in advance for your assistance and best regards
Ben

Current config (if you need more information please let me know):

inxi -Fxzd


System:
  Kernel: 6.5.0-14-generic x86_64 bits: 64 compiler: N/A
    Desktop: Cinnamon 6.0.4 Distro: Linux Mint 21.3 Virginia
    base: Ubuntu 22.04 jammy
Machine:
  Type: Desktop System: Alienware product: Alienware Aurora R12 v: 1.1.23
    serial: <superuser required>
  Mobo: Alienware model: 0P0JWX v: A00 serial: <superuser required>
    UEFI: Alienware v: 1.1.23 date: 11/08/2023
CPU:
  Info: 8-core model: 11th Gen Intel Core i9-11900KF bits: 64 type: MT MCP
    arch: Rocket Lake rev: 1 cache: L1: 640 KiB L2: 4 MiB L3: 16 MiB
  Speed (MHz): avg: 827 high: 933 min/max: 800/5100:5300 cores: 1: 800
    2: 800 3: 800 4: 925 5: 800 6: 800 7: 800 8: 800 9: 933 10: 923 11: 800
    12: 800 13: 862 14: 800 15: 800 16: 800 bogomips: 112128
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
Graphics:
  Device-1: NVIDIA GA102 [GeForce RTX 3090] vendor: Dell driver: nvidia
    v: 535.146.02 bus-ID: 02:00.0
  Device-2: Microsoft LifeCam HD-5000 type: USB
    driver: snd-usb-audio,uvcvideo bus-ID: 1-4.1:4
  Display: x11 server: X.Org v: 1.21.1.4 driver: X: loaded: nvidia
    unloaded: fbdev,modesetting,nouveau,vesa gpu: nvidia
    resolution: 3840x1600~60Hz
  OpenGL: renderer: NVIDIA GeForce RTX 3090/PCIe/SSE2
    v: 4.6.0 NVIDIA 535.146.02 direct render: Yes
Audio:
  Device-1: Intel vendor: Dell driver: snd_hda_intel v: kernel
    bus-ID: 00:1f.3
  Device-2: NVIDIA GA102 High Definition Audio vendor: Dell
    driver: snd_hda_intel v: kernel bus-ID: 02:00.1
  Device-3: Microsoft LifeCam HD-5000 type: USB
    driver: snd-usb-audio,uvcvideo bus-ID: 1-4.1:4
  Device-4: GN Netcom Jabra Link 380 type: USB
    driver: jabra,snd-usb-audio,usbhid bus-ID: 1-4.4.1:12
  Sound Server-1: ALSA v: k6.5.0-14-generic running: yes
  Sound Server-2: PulseAudio v: 15.99.1 running: yes
  Sound Server-3: PipeWire v: 0.3.48 running: yes
Network:
  Device-1: Intel Comet Lake PCH CNVi WiFi vendor: Rivet Networks
    driver: iwlwifi v: kernel bus-ID: 00:14.3
  IF: wlo1 state: down mac: <filter>
  Device-2: Realtek Killer E3000 2.5GbE vendor: Dell driver: r8169
    v: kernel port: 3000 bus-ID: 04:00.0
  IF: enp4s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
  IF-ID-1: vmnet1 state: unknown speed: N/A duplex: N/A mac: <filter>
  IF-ID-2: vmnet8 state: unknown speed: N/A duplex: N/A mac: <filter>
Bluetooth:
  Device-1: Intel AX201 Bluetooth type: USB driver: btusb v: 0.8
    bus-ID: 1-14:9
  Report: hciconfig ID: hci0 rfk-id: 0 state: up address: <filter>
    bt-v: 3.0 lmp-v: 5.2
Drives:
  Local Storage: total: 14.6 TiB used: 7.11 TiB (48.7%)
  ID-1: /dev/nvme0n1 vendor: Samsung model: PM981a NVMe 2048GB
    size: 1.86 TiB temp: 37.9 C
  ID-2: /dev/sda vendor: Seagate model: ST2000DM008-2FR102 size: 1.82 TiB
  ID-3: /dev/sdb vendor: Samsung model: SSD 870 QVO 8TB size: 7.28 TiB
  ID-4: /dev/sdc vendor: Samsung model: SSD 870 QVO 4TB size: 3.64 TiB
  Message: No optical or floppy data found.
Partition:
  ID-1: / size: 3.58 TiB used: 441.55 GiB (12.0%) fs: ext4 dev: /dev/sdc3
  ID-2: /boot/efi size: 146 MiB used: 92 MiB (63.0%) fs: vfat
    dev: /dev/nvme0n1p1
Swap:
  ID-1: swap-1 type: file size: 2 GiB used: 0 KiB (0.0%) file: /swapfile
Sensors:
  System Temperatures: cpu: 48.0 C pch: 49.0 C mobo: N/A gpu: nvidia
    temp: 62 C
  Fan Speeds (RPM): N/A gpu: nvidia fan: 38%
Info:
  Processes: 387 Uptime: 36m Memory: 125.47 GiB used: 3.78 GiB (3.0%)
  Init: systemd runlevel: 5 Compilers: gcc: 11.4.0 Packages: 2934 Shell: Bash
  v: 5.1.16 inxi: 3.3.13

Please try reseating the gpu in its slot, make sure power connectors are properly seated, remove/reconnect them.

Thanks a lot for your answer! I did that and unfortunately it did not change anything.

But just out of curiosity: how could it have changed anything when the card works perfectly fine when I boot Windows instead of Linux?

Linux has more aggressive clocking so flaws in power supply/connection surfaces earlier.
You might check if limiting clocks makes things more stable
nvidia-smi -lgc 300,1200

Again thanks a lot for your help, I really appreciate that!
Forgive me another stupid question… what exactly did those two values change?
I performed a nvidia-smi -q -d CLOCK before


==============NVSMI LOG==============

Timestamp                                 : Mon Feb 19 12:42:06 2024
Driver Version                            : 535.154.05
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:02:00.0
   Clocks
       Graphics                          : 210 MHz
       SM                                : 210 MHz
       Memory                            : 405 MHz
       Video                             : 555 MHz
   Applications Clocks
       Graphics                          : N/A
       Memory                            : N/A
   Default Applications Clocks
       Graphics                          : N/A
       Memory                            : N/A
   Deferred Clocks
       Memory                            : N/A
   Max Clocks
       Graphics                          : 2100 MHz
       SM                                : 2100 MHz
       Memory                            : 9751 MHz
       Video                             : 1950 MHz
   Max Customer Boost Clocks
       Graphics                          : N/A
   SM Clock Samples
       Duration                          : Not Found
       Number of Samples                 : Not Found
       Max                               : Not Found
       Min                               : Not Found
       Avg                               : Not Found
   Memory Clock Samples
       Duration                          : Not Found
       Number of Samples                 : Not Found
       Max                               : Not Found
       Min                               : Not Found
       Avg                               : Not Found
   Clock Policy
       Auto Boost                        : N/A
       Auto Boost Default                : N/A

and after

==============NVSMI LOG==============

Timestamp                                 : Mon Feb 19 12:44:06 2024
Driver Version                            : 535.154.05
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:02:00.0
    Clocks
        Graphics                          : 300 MHz
        SM                                : 300 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 9751 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    SM Clock Samples
        Duration                          : Not Found
        Number of Samples                 : Not Found
        Max                               : Not Found
        Min                               : Not Found
        Avg                               : Not Found
    Memory Clock Samples
        Duration                          : Not Found
        Number of Samples                 : Not Found
        Max                               : Not Found
        Min                               : Not Found
        Avg                               : Not Found
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A

So the 300 seems to be Clocks\Graphics what was 210 MHz and is now 300 MHz
But what is the 1200? This I don’t see in the after-Output.

And another question: does this setting need to be set after every reboot?

Sorry for asking stupid questions but as you might have guessed I have never messed with those settings before :-)

It’s the maximum clock to be used when under load. Can be used to avoid boost situations to identify possible power issues.

Yes. Tough it shouldn’t be used on a regular basis but only to identify issues.

Unfortunately the freezes still occur…

But it seems when I set the PowerMizer-Settings in the NVIDIA-Settings to “Prefer Maximum Performance” everything is fine. At least, I haven’t seen a freeze yet with that setting yet. Unfortunately I have to manually set that after every reboot. Is it possible to set that as default?

You would need to save the config and then create an autostart item running nvidia-settings --load-config-only after login.

There is an easier way… I created an autostart item in Mint with 60 secs delay:
/usr/bin/nvidia-settings -a “[gpu:0]/GpuPowerMizerMode=1”

Works fine and since setting it I haven’t had a crash for nearly a week now so I am mildly optimistic that this solved the freezes (though I still find it strange).