340.107 legacy driver and Kernel 4.20. Kernel crash when running any OpenGL application.

I’m using a GTX 260M with openSUSE Tumbleweed, and i’m not being able to use the desktop since i updated the kernel ( from 4.8.8 to 4.20 ).
Tried different kernels without success from 4.12 and up to 5.0. From 4.12 to 4.18 i was not able to build the driver due to this error:
“Cannot generate ORC metadata for CONFIG_UNWINDER_ORC=y, please install libelf-dev, libelf-devel or elfutils-libelf-devel” nvidia legacy". Installed libelf-dev ( also the 32bit package ) but does not seem to help. I also encountered another issue building the driver, i was getting “module: nvidia: Unknown rela relocation: 4” when trying to load the kernel module, but i fixed downgrading the binutils package after reading [url]https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=908568[/url].

So i tried newer kernels until it compiled but i’m getting crashes on 4.20 and 5.0 ( with the driver patched to make it compile in the latest kernel ).

I’m able to use the driver without any issues in the 4.8.8 kernel ( but i’m having another issue, not related, that is a blocker to continue using that kernel version ).

nvidia-bug-report.sh crashes linux but at least with safe-mode i was able to recover the crash report.
Also i was able to capture the kernel errors when the system crahses.

Any help will be appreciated.
nvidia-260m-crash.log (12.1 KB)
nvidia-bug-report.log.gz (60.4 KB)

Check this out:
[url]https://devtalk.nvidia.com/default/topic/1046661/linux/-solved-linux-opensuse-leap-15-0-64bit-unable-to-link-o-files-compiled-by-the-nvidia-installer/1[/url]

Thanks for trying to help. Sadly i don’t have that issue, but it made sense.

You’re overriding the gcc check and use gcc 7.4 to compile the modules for a kernel compiled using gcc 8.3.1. That will never really work.

Hi, i uploaded the crash from one of the many test i tried. I actually build first the module with GCC 8.3.1 too but i had the exact same issue. Anyways, here’s my crash log with the module built with GCC 8.3.1.
nvidia-260m-crash.log (15.2 KB)
nvidia-bug.report.log.gz (71.9 KB)

Ok, looking at the crashdump, at the beginning, there’s this:

Mar 23 15:49:36 spartan-nb kernel: pciehp 0000:00:01.0:pcie004: Slot(16): Card not present
Mar 23 15:49:36 spartan-nb kernel: pciehp 0000:00:01.0:pcie004: Slot(16): Card present

meaning, the pcie hotplug driver detects a removal and instant adding back of the gpu. The pcie-hp driver was rewritten for the 4.19 kernel so that seems to have added a bug in your case. Please try to disable it using the kernel parameter
pci=nopciehp
and if that helps, report a bug with your distro’s bug-tracker.

Thanks for the help again.
I tried disabling pcie hot plug but the command does not seem to work for me.

Mar 24 15:37:27 spartan-nb kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-4.20.13-1.gfb7c4a5-default root=UUID=15c39a59-8b62-4828-9bd7-e0b99609dc7b  resume=/dev/disk/by-id/ata-ST9320421AS_5TJ0PSCY-part5 splash=silent quiet pci=nopciehp showopts vga=792 nomodeset
Mar 24 15:37:27 spartan-nb kernel: PCI: Unknown option `nopciehp'

This is from my boot log. I google to see if the command is wrong but i’m not getting much information about it.

Thanks.

Sorry, reading the whole story it surfaced that this parameter was proposed but never actually implemented. So I don’t really know an easy way to disable hotplug without building a custom kernel. Don’t know if the pcie slot capability can be manipulated. Please post the output of:

sudo lspci -vv -s 0000:00:01.0

No problem. For the moment i can continue using kernel 4.8.8, as i managed to fix my non-related problems that took me to upgrade the kernel ( i was having issues with Qt5 because of a change that breaks Qt on old kernels https://superuser.com/questions/1347723/arch-on-wsl-libqt5core-so-5-not-found-despite-being-installed ). The problem is that having an old kernel is not ideal, but i can deal with that.

Here’s the output of the command:

00:01.0 PCI bridge: Intel Corporation Mobile 4 Series Chipset PCI Express Graphics Port (rev 07) (prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 24
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        I/O behind bridge: 0000c000-0000cfff 
        Memory behind bridge: fa000000-fdefffff 
        Prefetchable memory behind bridge: 00000000d0000000-00000000dfffffff 
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
        BridgeCtl: Parity- SERR+ NoISA- VGA+ VGA16+ MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [88] Subsystem: ASUSTeK Computer Inc. Device 19a7
        Capabilities: [80] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
                Address: fee0300c  Data: 41d1
        Capabilities: [a0] Express (v1) Root Port (Slot+), MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0
                        ExtTag- RBE+
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #2, Speed 2.5GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <256ns, L1 <4us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM L0s Enabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s (ok), Width x1 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise-
                        Slot #16, PowerLimit 75.000W; Interlock- NoCompl+
                SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet+ CmdCplt- HPIrq+ LinkChg-
                        Control: AttnInd Off, PwrInd On, Power- Interlock-
                SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
                        Changed: MRL- PresDet- LinkState-
                RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
                RootCap: CRSVisible-
                RootSta: PME ReqID 0000, PMEStatus- PMEPending-
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed+ WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=01
                        Status: NegoPending- InProgress-
        Capabilities: [140 v1] Root Complex Link
                Desc:   PortNumber=02 ComponentID=01 EltType=Config
                Link0:  Desc:   TargetPort=00 TargetComponent=01 AssocRCRB- LinkType=MemMapped LinkValid+
                        Addr:   00000000fed19000
        Kernel driver in use: pcieport
        Kernel modules: shpchp

Thanks!

Looks like indeed the bios incorrectly sets the Hotplug capability (+), it’s a notebook after all. Still doesn’t explain why the new pciehp driver freaks out. Maybe also check for a bios update and cross-check if the hotplug bit is also set when using the 4.8 kernel.

On newer kernel, you could try an unbind right on boot as root:

echo "0000:00:01.0:pcie04" > /sys/bus/pci_express/drivers/pciehp/unbind

I checked on 4.8.8 and hot-plug is also enabled for the PCIe lane of the VGA. Also checked for a BIOS update, and there was one but didn’t seem to have any fix related, anyway i updated it.

Tried this, disabling the hot plug on the VGA. The system crashes, but it produces a different error as you could imagine. I’ll attach the crash log that i was able to capture. But this time doesn’t seem to clear what’s going on.

Thanks again.
m260.crash.log (4.7 KB)

So it seems like something is turning the slot off and on again and the pcie-hp driver was just reacting to this. Really kind of weird.