RTX 3070 Ti falls off the bus on Razer Blade 15 2022

hi all,

I recently installed Linux Mint 21 MATE on a new Razer Blade 15 (2022) laptop.
I installed the package linux-oem-22.04 in order to easily upgrade from kernel 5.15 to 5.17 and solve my wireless issue.
I then installed the latest proprietary NVIDIA driver on it, nvidia-driver-525, precisely version 526.60.11.
Right after installing the NVIDIA driver, I had to wrestle with this nasty “out of memory” error at boot time:

and was able to solve it by following comment 25.
I was then able to boot successfully a fresh 5.17 kernel with NVIDIA driver.

The problem is that a few minutes after booting up the laptop, the GPU dies.
This happens both when the PRIME profile is on-demand, the default, which thankfully means that I can still use the desktop although any nvidia-related command errors out, but also when the PRIME profile is set to nvidia, which means that the system freezes completely and needs a hard reboot.
When trying to connect to an external monitor, the problems happens as soon as I plug in the HDMI cable.

The system logs report the nasty “GPU has fallen off the bus” error, which is often described to be related to power supply issues or thermals.
Power supply should not be the problem since this is an embedded laptop from a reputable brand, not a self-assembled hack job of a desktop with a poor PSU.
Thermals are not to blame either, as this consistently happens a few minutes (say, five) after booting, without any usage whatsoever (temperature around 40 C), definitely not after a heavy computational or gaming session.

I read that one could try and set the persistence mode on the GPU to avoid an automatic switch-off by typing:

sudo nvidia-smi -pm 1

and that such command is deprecated and that one should instead enable the systemctl service named nvidia-persistenced.
In my case, the service was already enabled and running even as I was having these issues.
I noticed that the service itself was running with parameter --no-persistence-mode, so I figured that might be the problem and modified the service file to run with --persistence-mode, instead.
That had no effect on the error, and the GPU still “falls off the bus” after a few minutes.
Finally, since I am running with PRIME profile on-demand, I can see that X is successfully loaded on the GPU by running nvidia-smi right as I get the desktop, but before the GPU dies out.
In other words, it’s not like the GPU gets switched off because nothing is using it, say, after having completed some CUDA computations – X is using it!

I also tried, without success:

  • installing linux-oem-22.04b to load kernel 6.0.0;
  • boot option nvidia-drm.modeset=0;
  • boot option pcie_aspm=off;
  • boot options pci=check_enable_amd_mmconf and idle=nomwait;
  • adjusting the clocks with nvidia-smi -lgc 300,1750.

I have read all the “GPU has fallen off the bus” threads I could find, but no solution.
Any help is appreciated, and I am happy to share any logs to you knowledgeable gurus. Cheers!


nvidia logs at boot, before crash:

$ nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3070 Ti Laptop GPU (UUID: GPU-d7e3314f-0671-9225-6b48-39bfc97fc3c7)

$ nvidia-smi 
Thu Dec  8 10:13:20 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11    Driver Version: 525.60.11    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   43C    P8    10W /  N/A |      5MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1839      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

after crash:

$ nvidia-smi 
Unable to determine the device handle for GPU0000:01:00.0: Unknown Error

further info, after crash:

System:
  Kernel: 5.17.0-1021-oem x86_64 bits: 64 compiler: gcc v: 11.3.0
    Desktop: MATE 1.26.0 info: mate-panel wm: marco 1.26.0 vt: 7
    dm: LightDM 1.30.0 Distro: Linux Mint 21 Vanessa base: Ubuntu 22.04 jammy
Machine:
  Type: Laptop System: Razer product: Blade 15 (2022) - RZ09-0421 v: 8.04
    serial: <superuser required> Chassis: type: 10 serial: <superuser required>
  Mobo: Razer model: CH580 v: 4 serial: <superuser required> UEFI: Razer
    v: 1.08 date: 02/16/2022
CPU:
  Info: 14-core (6-mt/8-st) model: 12th Gen Intel Core i7-12800H bits: 64
    type: MST AMCP smt: enabled arch: Alder Lake rev: 3 cache: L1: 1.2 MiB
    L2: 11.5 MiB L3: 24 MiB
  Speed (MHz): avg: 534 high: 699 min/max: 400/4800:3700 cores: 1: 510
    2: 441 3: 499 4: 548 5: 552 6: 681 7: 490 8: 467 9: 469 10: 447 11: 435
    12: 445 13: 615 14: 633 15: 608 16: 530 17: 552 18: 496 19: 580 20: 699
    bogomips: 112127
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
Graphics:
  Device-1: Intel Alder Lake-P Integrated Graphics vendor: Razer USA
    driver: i915 v: kernel ports: active: eDP-1 empty: none bus-ID: 00:02.0
    chip-ID: 8086:46a6 class-ID: 0300
  Device-2: NVIDIA GA104 [Geforce RTX 3070 Ti Laptop GPU] driver: nvidia
    v: 525.60.11 pcie: speed: Unknown lanes: 63 ports: active: none
    empty: DP-1, DP-2, DP-3, HDMI-A-1 bus-ID: 01:00.0 chip-ID: 10de:24a0
    class-ID: 0300
  Device-3: IMC Networks Integrated RGB Camera type: USB driver: uvcvideo
    bus-ID: 1-2:2 chip-ID: 13d3:5279 class-ID: 0e02 serial: <filter>
  Display: x11 server: X.Org v: 1.21.1.3 compositor: marco v: 1.26.0
    driver: X: loaded: modesetting,nvidia unloaded: fbdev,nouveau,vesa
    gpu: i915 display-ID: :0.0 screens: 1
  Screen-1: 0 s-res: 1920x1080 s-dpi: 98 s-size: 499x280mm (19.6x11.0")
    s-diag: 572mm (22.5")
  Monitor-1: eDP-1 model: TL156VDXP02-0 res: 1920x1080 hz: 60 dpi: 142
    size: 344x194mm (13.5x7.6") diag: 395mm (15.5") modes: 1920x1080
  OpenGL: renderer: Mesa Intel Graphics (ADL GT2) v: 4.6 Mesa 22.0.5
    direct render: Yes

nvidia debug log @ transfer.sh/JphnSL/nvidia-bug-report.log.gz
(internal upload feature was yielding an error)

Hi, I am running Fedora 37 on an Razer Blade 15 2022.
So far the only solution for me is to open nvidia-settings and in the Power Mixer change Prefered Mode to Prefer Maximum Performance. As long as it stays there it wont crash. Not the best solution but it works for now.
Also nouveau with boot parameters nouveau.runpm=0 works.

1 Like

Please check if downgrading the driver to 515/520 works.
If not, please try setting kernel parameter
intel_idle.max_cstate=1

My Linux Mint Driver Manager does not show up 520 as an option.
I tried installing it manually but APT just wanted to install nvidia-driver-520 without uninstalling the rest, so I did not do that.
Regarding 515, I was reading that there are bugs specifically related to 3xxx cards so I will try it only if everything else fails.

I tried, but nothing changed.
Same working GPU as soon as I boot the system, same “falls off the bus” error after a few minutes of idle time.

Wow, thanks! That seems to be a workaround as the GPU does not fall off the bus right away!
I will use it like this and try out an external monitor soon to verify that it works alright.

Do you have further suggestions?
As a first test, it does not seem that the PowerMizer setting survives a reboot.
I also tried from the command line to type:

nvidia-settings -a "[gpu:0]/GpuPowerMizerMode=1"

but when I opened Nvidia settings immediately after that, the PowerMizer setting was still on “Auto”.
Any thoughts?

I will need to exploit the GPU, if it stays on the bus, so nouveau is blacklisted by my nvidia driver.
Are you allowed to run nouveau for X and still load the nvidia driver for CUDA computations?

Also, @BadWolf84, the laptop was sleeping alright before I set the PowerMizer mode, now it’s waking from sleep and showing me the classic (non-blinking) typing underscore and no more desktop.
Did you have to do anything to resume from sleep on Fedora?

Would this kind of bug be eventually fixed in a future version of NVIDIA drivers?
I would imagine most people would not spend this much time trying to fix it.
Thanks for the replies! I got now to a barely functioning environment, but I would still like to get your thoughts on the reasons for the issue and any possible proper fix.

Works for me I have an external monitor on the HDMI Port.

I saved the configuration file. Under nvidia-settings Configuration → Save Current Configuration. This survived a reboot. “~/.nvidia-settings-rc”

Unfortunately no. Sleep is not working for me right now. It immediately wakes up again.

I dont know really. In my opinion no because the driver for X is never loaded. There might be a solution if you start a second X in the background with nvidia driver and then feed the output to the first one. Thats a solution if you want to mimic real optimus like in Windows if you want to load the screen with intel driver and only certain apps with nvidia. So I figure it might also work for Nouveau / Nvidia.

In the Arch Wiki I read about some Razer Blades having faulty ACPI DSDT. I suspect its the same for this machine. Unfortunately I havent found the time to dig deeper into this rabbit hole.
If I find the time I will investigate further into this because I am really interested in a proper fix.

I checked again and I can confirm that the PowerMizer setting does not end up in your nvidia RC file, even if you explicitly “Save Current Configuration”.
Are you sure yours does? would you mind sharing it?

Indeed, many folks around the web were asking how to automate it.
I ended up following this comment which suggests to put a .desktop file in your autostart so as to be sure it gets executed when your X session is alive:

[Desktop Entry]
Type=Application
Name=Autoset Nvidia to Performance Mode
Exec="nvidia-settings" "-a" "[gpu:0]/GpuPowerMizerMode=1"

I have not rebooted yet but I don’t see how that could fail.

#
# /home/badwolf/.nvidia-settings-rc
#
# Configuration file for nvidia-settings - the NVIDIA Settings utility
# Generated on Tue Dec 13 09:23:37 2022
#

# ConfigProperties:

RcFileLocale = C
DisplayStatusBar = Yes
SliderTextEntries = Yes
IncludeDisplayNameInConfigFile = No
UpdateRulesOnProfileNameChange = Yes
Timer = PowerMizer_Monitor_(GPU_0),Yes,1000
Timer = Thermal_Monitor_(GPU_0),Yes,1000
Timer = Memory_Used_(GPU_0),Yes,3000

# Attributes:

[DPY:HDMI-1-0]/RedBrightness=0.000000
[DPY:HDMI-1-0]/GreenBrightness=0.000000
[DPY:HDMI-1-0]/BlueBrightness=0.000000
[DPY:HDMI-1-0]/RedContrast=0.000000
[DPY:HDMI-1-0]/GreenContrast=0.000000
[DPY:HDMI-1-0]/BlueContrast=0.000000
[DPY:HDMI-1-0]/RedGamma=1.000000
[DPY:HDMI-1-0]/GreenGamma=1.000000
[DPY:HDMI-1-0]/BlueGamma=1.000000
[DPY:HDMI-1-0]/Dithering=0
[DPY:HDMI-1-0]/DitheringMode=0
[DPY:HDMI-1-0]/DitheringDepth=0
[DPY:HDMI-1-0]/DigitalVibrance=0
[DPY:HDMI-1-0]/ColorSpace=0
[DPY:HDMI-1-0]/ColorRange=0
[DPY:HDMI-1-0]/SynchronousPaletteUpdates=0
[GPU:0]/GPUPowerMizerMode=1

This is my ~/.nvidia-settings-rc

I checked the rpm and it seems to be there is an desktop file included.
/etc/xdg/autostart/nvidia-settings-user.desktop

[Desktop Entry]
Type=Application
Exec=nvidia-settings -l
Icon=nvidia-settings
Hidden=false
NoDisplay=false
Name[en_GB]=nvidia-settings
Name=nvidia-settings
Comment[en_GB]=Load user settings
Comment=Load user settings
X-GNOME-Autostart-Delay=30
X-GNOME-Autostart-enabled=true

I am not sure if the assign command alone will work. I would also execute a nvidia-settings -l so its loaded.

That is weird, my NVIDIA settings definitely does not export this line.
Are we running the same version?

$ nvidia-settings --version

nvidia-settings:  version 510.47.03

Oh yes, I see those on my system, too.
It seems they are the standard .desktop files created by the NVIDIA installation.

$ cd /etc/xdg/autostart && grep Exec nvidia*
nvidia-prime.desktop:Exec=/usr/lib/nvidia-prime-applet/nvidia-prime
nvidia-settings-autostart.desktop:Exec=sh -c '/usr/bin/nvidia-settings --load-config-only'

which indeed include the loading from the RC setting.
In any case, the nvidia-settings --assign line does not need an RC file and works directly with the running daemon (or whatever that is).

In any case, this issue is extremely weird.
Does it have to do with the GPU clocks?
In another thread here I read that Windows has more strict limits on what the GPU can do, whereas Linux does not limit anything and maybe that is why the GPU falls off the bus.

Or, on the other hand, it might be due to the specs of my Razer Blade, which is equipped with an RTX 3070 Ti and, most importantly, and internal FullHD display with 360 Hz refresh rate.
Linux only sees the 60 Hz setting, which might (?) be causing some instability.
Then again, even when I was able to connect the external monitor as soon as I booted the system, and was able to disable the internal monitor, the “GPU falls off the bus” error was just a minute or two away, thus disabling the internal monitor was not removing the source of the instability.

Yes I use a newer version: 520.56.06

It must have something to do with the Power Management states.
So there is a possibility that there is some ACPI error I will investigate if i find the time.

60 Hz should not be a problem for stability.

what driver are you using?
keep in mind that apparently there is a mismatch between the driver’s version and the one for nvidia-settings:

# I am running driver 525

$ nvidia-smi 
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11    Driver Version: 525.60.11    CUDA Version: 12.0     |
+-----------------------------------------------------------------------------+
[...]

# but

$ nvidia-settings --version

nvidia-settings:  version 510.47.03
  The NVIDIA Settings tool.
[...]

just to confirm that we are talking about the same things!

60 Hz should? or should not?
or did you mean to type 360 Hz?
I am confused. :)

I am running driver 520.56.06

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06    Driver Version: 520.56.06    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+

nvidia-settings:  version 520.56.06
The NVIDIA Settings tool.

I edited my mistake. I meant to write 60Hz is not a problem. =)

Which cpu model is built into the notebook?

huh, in my case driver version and nvidia-settings version are different… oh, well.

okay, that makes more sense.

Here it is:

CPU: 12th Gen Intel i7-12800H (20) @ 4.800GHz 

When I have some time, I will try running an earlier nvidia driver, e.g. 470, and report back here.

Quite new cpu, please try upgrading to latest kernel using the liquorix ppa, if not already tried.

I’m currently running 5.17.0 thanks to package linux-oem-22.04, which solved my wifi issue.
Interestingly, I also tried 6.0.0 through linux-oem-22.04b, but that broke the wireless just like kernel 5.15 …

Is the liquorix PPA going to provide a different kernel? I’m not familiar with it.

The liquorix kernel also provides kernel 6.0. Though might be just a firmware issue with your wifi.

https://bugzilla.kernel.org/show_bug.cgi?id=156341

In this thread they talk about PCI powersave and how there is some stuff in the ACPI that needed to be done for Windows but Linux Kernel has Problems with that etc. I suspect that our Problem has a similar cause.

I see, thanks, I’ll skim through that once I have some time.

I tried driver 470 without success – the same happens.
Actually, it’s even worse, since at boot time I don’t see most options from the nvidia-settings tool, e.g. PowerMizer settings… they are just not there, as if the GPU was not communicating properly, although nvidia-smi works and it shows up the X process, too.
Until, a few minutes later, the GPU falls off the bus as usual.

Regarding the suspend issue, I think it is now working correctly after having added mem_sleep_default=deep to my GRUB_CMDLINE_LINUX_DEFAULT and updating grub.
I have not tested thoroughly but it suspended and resumed from sleep successfully a handful of times, so fingers crossed that it will work reliably.