340.107 legacy driver and Kernel 4.20. Kernel crash when running any OpenGL application.

Prognoz · March 22, 2019, 10:16pm

I’m using a GTX 260M with openSUSE Tumbleweed, and i’m not being able to use the desktop since i updated the kernel ( from 4.8.8 to 4.20 ).
Tried different kernels without success from 4.12 and up to 5.0. From 4.12 to 4.18 i was not able to build the driver due to this error:
“Cannot generate ORC metadata for CONFIG_UNWINDER_ORC=y, please install libelf-dev, libelf-devel or elfutils-libelf-devel” nvidia legacy". Installed libelf-dev ( also the 32bit package ) but does not seem to help. I also encountered another issue building the driver, i was getting “module: nvidia: Unknown rela relocation: 4” when trying to load the kernel module, but i fixed downgrading the binutils package after reading [url]https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=908568[/url].

So i tried newer kernels until it compiled but i’m getting crashes on 4.20 and 5.0 ( with the driver patched to make it compile in the latest kernel ).

I’m able to use the driver without any issues in the 4.8.8 kernel ( but i’m having another issue, not related, that is a blocker to continue using that kernel version ).

nvidia-bug-report.sh crashes linux but at least with safe-mode i was able to recover the crash report.
Also i was able to capture the kernel errors when the system crahses.

Any help will be appreciated.
nvidia-260m-crash.log (12.1 KB)
nvidia-bug-report.log.gz (60.4 KB)

generix · March 23, 2019, 2:53am

Check this out:
[url]https://devtalk.nvidia.com/default/topic/1046661/linux/-solved-linux-opensuse-leap-15-0-64bit-unable-to-link-o-files-compiled-by-the-nvidia-installer/1[/url]

Prognoz · March 23, 2019, 3:07am

Thanks for trying to help. Sadly i don’t have that issue, but it made sense.

generix · March 23, 2019, 6:09pm

You’re overriding the gcc check and use gcc 7.4 to compile the modules for a kernel compiled using gcc 8.3.1. That will never really work.

Prognoz · March 23, 2019, 7:00pm

Hi, i uploaded the crash from one of the many test i tried. I actually build first the module with GCC 8.3.1 too but i had the exact same issue. Anyways, here’s my crash log with the module built with GCC 8.3.1.
nvidia-260m-crash.log (15.2 KB)
nvidia-bug.report.log.gz (71.9 KB)

generix · March 24, 2019, 3:53pm

Ok, looking at the crashdump, at the beginning, there’s this:

Mar 23 15:49:36 spartan-nb kernel: pciehp 0000:00:01.0:pcie004: Slot(16): Card not present
Mar 23 15:49:36 spartan-nb kernel: pciehp 0000:00:01.0:pcie004: Slot(16): Card present

meaning, the pcie hotplug driver detects a removal and instant adding back of the gpu. The pcie-hp driver was rewritten for the 4.19 kernel so that seems to have added a bug in your case. Please try to disable it using the kernel parameter
pci=nopciehp
and if that helps, report a bug with your distro’s bug-tracker.

Prognoz · March 24, 2019, 7:06pm

Thanks for the help again.
I tried disabling pcie hot plug but the command does not seem to work for me.

Mar 24 15:37:27 spartan-nb kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-4.20.13-1.gfb7c4a5-default root=UUID=15c39a59-8b62-4828-9bd7-e0b99609dc7b  resume=/dev/disk/by-id/ata-ST9320421AS_5TJ0PSCY-part5 splash=silent quiet pci=nopciehp showopts vga=792 nomodeset
Mar 24 15:37:27 spartan-nb kernel: PCI: Unknown option `nopciehp'

This is from my boot log. I google to see if the command is wrong but i’m not getting much information about it.

Thanks.

generix · March 24, 2019, 7:38pm

Sorry, reading the whole story it surfaced that this parameter was proposed but never actually implemented. So I don’t really know an easy way to disable hotplug without building a custom kernel. Don’t know if the pcie slot capability can be manipulated. Please post the output of:

sudo lspci -vv -s 0000:00:01.0

Prognoz · March 24, 2019, 7:55pm

No problem. For the moment i can continue using kernel 4.8.8, as i managed to fix my non-related problems that took me to upgrade the kernel ( i was having issues with Qt5 because of a change that breaks Qt on old kernels https://superuser.com/questions/1347723/arch-on-wsl-libqt5core-so-5-not-found-despite-being-installed ). The problem is that having an old kernel is not ideal, but i can deal with that.

Here’s the output of the command:

00:01.0 PCI bridge: Intel Corporation Mobile 4 Series Chipset PCI Express Graphics Port (rev 07) (prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 24
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        I/O behind bridge: 0000c000-0000cfff 
        Memory behind bridge: fa000000-fdefffff 
        Prefetchable memory behind bridge: 00000000d0000000-00000000dfffffff 
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
        BridgeCtl: Parity- SERR+ NoISA- VGA+ VGA16+ MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [88] Subsystem: ASUSTeK Computer Inc. Device 19a7
        Capabilities: [80] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
                Address: fee0300c  Data: 41d1
        Capabilities: [a0] Express (v1) Root Port (Slot+), MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0
                        ExtTag- RBE+
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #2, Speed 2.5GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <256ns, L1 <4us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM L0s Enabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s (ok), Width x1 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise-
                        Slot #16, PowerLimit 75.000W; Interlock- NoCompl+
                SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet+ CmdCplt- HPIrq+ LinkChg-
                        Control: AttnInd Off, PwrInd On, Power- Interlock-
                SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
                        Changed: MRL- PresDet- LinkState-
                RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
                RootCap: CRSVisible-
                RootSta: PME ReqID 0000, PMEStatus- PMEPending-
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed+ WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=01
                        Status: NegoPending- InProgress-
        Capabilities: [140 v1] Root Complex Link
                Desc:   PortNumber=02 ComponentID=01 EltType=Config
                Link0:  Desc:   TargetPort=00 TargetComponent=01 AssocRCRB- LinkType=MemMapped LinkValid+
                        Addr:   00000000fed19000
        Kernel driver in use: pcieport
        Kernel modules: shpchp

Thanks!

generix · March 24, 2019, 8:06pm

Looks like indeed the bios incorrectly sets the Hotplug capability (+), it’s a notebook after all. Still doesn’t explain why the new pciehp driver freaks out. Maybe also check for a bios update and cross-check if the hotplug bit is also set when using the 4.8 kernel.

generix · March 24, 2019, 8:13pm

On newer kernel, you could try an unbind right on boot as root:

echo "0000:00:01.0:pcie04" > /sys/bus/pci_express/drivers/pciehp/unbind

Prognoz · March 25, 2019, 12:13am

I checked on 4.8.8 and hot-plug is also enabled for the PCIe lane of the VGA. Also checked for a BIOS update, and there was one but didn’t seem to have any fix related, anyway i updated it.

Tried this, disabling the hot plug on the VGA. The system crashes, but it produces a different error as you could imagine. I’ll attach the crash log that i was able to capture. But this time doesn’t seem to clear what’s going on.

Thanks again.
m260.crash.log (4.7 KB)

generix · March 25, 2019, 9:47am

So it seems like something is turning the slot off and on again and the pcie-hp driver was just reacting to this. Really kind of weird.

Topic		Replies	Views
[370.28] with kernel [4.8] on >=2015 machines: driver claims card not supported if nvidia is not primary card Linux	37	21960	September 26, 2017
Kernel crash with 460.80, 460.84 and 465.31 drivers with Quadro P2000 but work normaly with Quadro P2200 (or 460.73.01 driver) Linux boot , kernel , driver	2	1230	July 23, 2021
375.10 - bad experience Linux	10	4407	October 14, 2021
Kernel Panics on CentOS7 - Geforce GTX 1080 Ti with Nvidia Driver 384.59 Linux	7	3407	December 5, 2017
Driver allocating memory over pci slot size Linux kernel	12	2865	February 16, 2021
304.135 driver patched to build on 4.x kernels (4.9 onwards) Linux	27	16456	January 31, 2020
555.58.02 - NVRM fallen of the bus after kernel version 6.9.7 (Fedora 40) Linux kernel , linux-driver , fedora	6	785	September 5, 2024
GPU runs for a couple of days and then disappears. Linux	1	506	October 13, 2018
Pascal Titan X's GPU's falling off the bus Linux	0	928	December 29, 2016
NVIDIA Linux x86_64 340.98 Driver not building with Kernel 4.9.0-rc2+ [RESOLVED] Linux	8	7052	October 14, 2021

340.107 legacy driver and Kernel 4.20. Kernel crash when running any OpenGL application.

Related topics