Ubuntu fails to boot with internal GPU but boots properly when external GPU is connected

So I have an internal GeForce RTX 2070 super and an external 3090.

Ubuntu will boot when the EGPU is connected but not without it. Also, frequently after it boots nvidia-smi will show only the external GPU. At other times it will initially show both but after some usage, the internal GPU will be dropped off eventually.

I’m running ubuntu 23.10 on kernel 6.5.0-26-generic, nvidia driver version 545.29.06, cuda toolkit 12.3.

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Sure.
nvidia-bug-report.log (3.9 MB)

[  490.878573] NVRM: GPU at PCI:0000:01:00: GPU-a6097a7b-d7c8-2215-7aaa-f73c623b1d6e
[  490.878579] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[  490.878582] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[  490.878590] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.

Might be a kernel/bios issue or the gpu is broken. Please check for a bios update first. Does it work with Windows?

BIOS is updated.

I’ve experimented with various kernels (e.g. 5.15, 6.2), and various ubuntus (20.04, 22.10). The problems started when updating to 5.15 iirc (while still on ubuntu 20.04). Initially, I was getting frequent, unrecoverable system freezes. Now I’m getting this.

But when the internal GPU drops off the bus, the system becomes very unstable: A lot of the time I can’t use even the external GPU (even though it’s available according to nvidia-smi), or if I log out the system goes into a fully unresponsive state.

Nope, I don’t have Windows.

I’ve run various hardware tests for the GPU but none of them yielded any results to indicate that the GPU somehow is physically broken.

@generix any other ideas? Btw, I’ve noticed that on idle my internal GPU is 45+ degrees celcius. Mostly hovering at around 50 though. Seems too much to me.

But on the other hand, afik it has been like so almost from when I bought this laptop.

It’s a ThinkPad T15g Gen 1, which is covered by the Lenovo Linux program so it’s probably the best idea to harass lenovo support with this.