PCI driver kernel module load problem

NVIDIA Jetson TX2
L4T 28.2.1 [ JetPack 3.3 or 3.2.1 ]
Board: t186ref
Ubuntu 16.04 LTS
Kernel Version: 4.4.38-tegra

I’m having some issues loading a PCI kernel module during startup on a Jetson TX2. When our system boots, the kernel has 3 attempts to load, and the third time is successful.
I have located the problem to be due to our PCI driver. The module init function gets called, however, after the call to pci_register_driver(), the probe() function does not. Instead there are lots of errors reported, eventually culminating in the message “Fixing recursive fault but reboot is needed!”. This repeats, then on the third attempt the module load is successful (probe() function called) and the kernel load completes.

Some of the kernel log, immediately after the pci_register_driver() call is as follows :

[    7.756180] xhci-tegra 3530000.xhci: Firmware timestamp: 2017-12-07 10:50:08 UTC, Version: 55.09 release
[    7.769348] xhci-tegra 3530000.xhci: xHCI Host Controller
[    7.775566] xhci-tegra 3530000.xhci: new USB bus registered, assigned bus number 1
[    7.785485] xhci-tegra 3530000.xhci: hcc params 0x0184fd25 hci version 0x100 quirks 0x00010810
[    7.795725] xhci-tegra 3530000.xhci: irq 60, io mem 0x03530000
[    7.802469] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002
[    7.810166] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    7.810171] usb usb1: Product: xHCI Host Controller
[    7.810173] usb usb1: Manufacturer: Linux 4.4.38-tegra xhci-hcd
[    7.810175] usb usb1: SerialNumber: 3530000.xhci
[    7.810623] hub 1-0:1.0: USB hub found
[    7.810641] hub 1-0:1.0: 4 ports detected
[    7.826478] dhd_module_init in
[    7.826509] tegra_net_perf_init: cannot get wifi sclk
[    7.826581] found wifi platform device bcmdhd_wlan
[    7.827957] wifi_platform_get_country_code_map: could not get country_code_map
[    7.827959] wifi_plat_dev_drv_probe:platform country code map is not available
[    7.827972] Power-up adapter 'DHD generic adapter'
[    7.827977] wifi_platform_set_power = 1
[    7.834340] xhci-tegra 3530000.xhci: xHCI Host Controller
[    7.834347] xhci-tegra 3530000.xhci: new USB bus registered, assigned bus number 2
[    7.834460] usb usb2: New USB device found, idVendor=1d6b, idProduct=0003
[    7.834462] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    7.834464] usb usb2: Product: xHCI Host Controller
[    7.834466] usb usb2: Manufacturer: Linux 4.4.38-tegra xhci-hcd
[    7.834467] usb usb2: SerialNumber: 3530000.xhci
[    7.834777] hub 2-0:1.0: USB hub found
[    7.834799] hub 2-0:1.0: 3 ports detected
[    7.835144] tegra-xotg xotg: otg: host 3530000.xhci registered
[    7.934622] xhci-tegra 3530000.xhci: entering ELPG
[    7.937239] xhci-tegra 3530000.xhci: entering ELPG done
[    8.028372] tegra-pcie 10003000.pcie-controller: link 2 down, retrying
[    8.030628] wifi_platform_bus_enumerate device present 1
[    8.045897] tegra-pcie 10003000.pcie-controller: link 2 down, ignoring
[    8.046298] tegra-pcie 10003000.pcie-controller: PCI host bridge to bus 0000:00
[    8.046343] pci_bus 0000:00: root bus resource [mem 0x50100000-0x57ffffff]
[    8.046348] pci_bus 0000:00: root bus resource [mem 0x58000000-0x7fffffff pref]
[    8.046373] pci_bus 0000:00: root bus resource [bus 00-ff]
[    8.046376] pci_bus 0000:00: root bus resource [io  0x1000-0xffff]
[    8.046408] pci 0000:00:01.0: [10de:10e5] type 01 class 0x060400
[    8.046513] pci 0000:00:01.0: PME# supported from D0 D1 D2 D3hot D3cold
[    8.046657] iommu: Adding device 0000:00:01.0 to group 56
[    8.046759] pci 0000:00:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    8.046892] pci 0000:01:00.0: [1172:e003] type 00 class 0x058000
[    8.046939] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x000001ff 64bit pref]
[    8.046960] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x0000ffff 64bit pref]
[    8.046977] pci 0000:01:00.0: reg 0x20: [mem 0x00000000-0x000fffff 64bit pref]
[    8.047169] iommu: Adding device 0000:01:00.0 to group 57
[    8.047218] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
[    8.047257] pci 0000:00:01.0: BAR 9: assigned [mem 0x58000000-0x581fffff 64bit pref]
[    8.047264] pci 0000:01:00.0: BAR 4: assigned [mem 0x58000000-0x580fffff 64bit pref]
[    8.047281] pci 0000:01:00.0: BAR 2: assigned [mem 0x58100000-0x5810ffff 64bit pref]
[    8.047297] pci 0000:01:00.0: BAR 0: assigned [mem 0x58110000-0x581101ff 64bit pref]
[    8.047313] pci 0000:00:01.0: PCI bridge to [bus 01]
[    8.047323] pci 0000:00:01.0:   bridge window [mem 0x58000000-0x581fffff 64bit pref]
[    8.048543] pcieport 0000:00:01.0: enabling device (0000 -> 0002)
[    8.048650] pcieport 0000:00:01.0: Signaling PME through PCIe PME interrupt
[    8.048652] pci 0000:01:00.0: Signaling PME through PCIe PME interrupt
[    8.048658] pcie_pme 0000:00:01.0:pcie01: service driver pcie_pme loaded
[    8.048764] aer 0000:00:01.0:pcie02: service driver aer loaded
[    8.048893] Bad mode in Synchronous Abort handler detected, code 0x86000007 -- IABT (current EL)
[    8.048898] Internal error: Oops - bad mode: 0 [#1] PREEMPT SMP
[    8.048911] Modules linked in: bcmdhd(+) gpu_beamformer(O) pci_tegra uio_pdrv_genirq uio bluedroid_pm
[    8.048918] CPU: 1 PID: 15 Comm: kworker/1:0 Tainted: G           O    4.4.38-tegra #23
[    8.048919] Hardware name: quill (DT)
[    8.048932] Workqueue: events pcie_delayed_detect [pci_tegra]
[    8.048935] task: ffffffc1ecdef080 ti: ffffffc1ece1c000 task.ti: ffffffc1ece1c000
[    8.048937] PC is at 0xffffffbffc0dc000
[    8.048946] LR is at pci_device_probe+0x98/0x10c
[    8.048949] pc : [<ffffffbffc0dc000>] lr : [<ffffffc0003900d0>] pstate: 60000045
[    8.048951] sp : ffffffc1ece1fa80
[    8.048956] x29: ffffffc1ece1fa80 x28: 0000000000000100 
[    8.048959] x27: ffffffc1e4d42200 x26: ffffffbffc0c2db8 
[    8.048962] x25: 0000000000000017 x24: ffffffbffc0d94a8 
[    8.048966] x23: ffffffbffc0d93a8 x22: ffffffbffc0d9410 
[    8.048969] x21: 0000000000000000 x20: ffffffc1dc886800 
[    8.048972] x19: ffffffc1dc886890 x18: 0000000000000000 
[    8.048975] x17: 0000007fa6bb9df8 x16: ffffffc000b62a60 
[    8.048979] x15: 0a01000a02020501 x14: 0ffffffffffffffe 
[    8.048983] x13: ffffffff00000000 x12: 0000000000000030 
[    8.048986] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f 
[    8.048989] x9 : feff716475687163 x8 : ffffffffffffffff 
[    8.048992] x7 : ffffffc1e5b8c798 x6 : ffffffc1e5b8c798 
[    8.048996] x5 : ffffffc1e5b8c799 x4 : 0000000000000000 
[    8.048999] x3 : ffffffc1ece1f9c0 x2 : ffffffbffc0dc000 
[    8.049002] x1 : ffffffbffc0d94a8 x0 : ffffffc1dc886800 
[    8.049003] 
[    8.049006] Process kworker/1:0 (pid: 15, stack limit = 0xffffffc1ece1c020)
[    8.049007] Call trace:
[    8.049032] [<ffffffbffc0dc000>] 0xffffffbffc0dc000
[    8.049038] [<ffffffc000575344>] driver_probe_device+0xc8/0x408
[    8.049041] [<ffffffc0005757dc>] __device_attach_driver+0x8c/0xb0
[    8.049045] [<ffffffc000573458>] bus_for_each_drv+0x54/0x94
[    8.049048] [<ffffffc00057516c>] __device_attach+0xc0/0x150
[    8.049050] [<ffffffc00057520c>] device_attach+0x10/0x18
[    8.049055] [<ffffffc000385bf8>] pci_bus_add_device+0x3c/0x60
[    8.049057] [<ffffffc000386014>] pci_bus_add_devices+0x44/0x90
[    8.049059] [<ffffffc000386040>] pci_bus_add_devices+0x70/0x90
[    8.049065] [<ffffffc0000941bc>] pci_common_init_dev+0x2d4/0x394
[    8.049072] [<ffffffbffc0bde54>] pcie_delayed_detect+0x894/0xa80 [pci_tegra]
[    8.049076] [<ffffffc0000bc2d0>] process_one_work+0x154/0x434
[    8.049079] [<ffffffc0000bc6e4>] worker_thread+0x134/0x40c
[    8.049082] [<ffffffc0000c1f30>] kthread+0xe0/0xf4
[    8.049089] [<ffffffc000084f90>] ret_from_fork+0x10/0x40
[    8.049324] wifi_platform_bus_enumerate device present 0
[    8.049457] ---[ end trace 0394ab4bd3dfd138 ]---
[    8.051204] Mass Storage Function, version: 2009/09/11
[    8.051208] LUN: removable file: (no medium)
[    8.051380] ------------[ cut here ]------------
[    8.051383] WARNING: at ffffffc0000a9074 [verbose debug info unavailable]
[    8.051394] Modules linked in: bcmdhd(+) gpu_beamformer(O) pci_tegra uio_pdrv_genirq uio bluedroid_pm
[    8.051395] 
[    8.051401] CPU: 1 PID: 15 Comm: kworker/1:0 Tainted: G      D    O    4.4.38-tegra #23
[    8.051402] Hardware name: quill (DT)
[    8.051415] task: ffffffc1ecdef080 ti: ffffffc1ece1c000 task.ti: ffffffc1ece1c000
[    8.051423] PC is at __local_bh_enable_ip+0x68/0xb8
[    8.051429] LR is at _raw_spin_unlock_bh+0x20/0x28
[    8.051431] pc : [<ffffffc0000a9074>] lr : [<ffffffc000b57418>] pstate: 400001c5
[    8.051432] sp : ffffffc1ece1f780
[    8.051437] x29: ffffffc1ece1f780 x28: ffffffc1ece1c000 
[    8.051440] x27: ffffffc1e4d42200 x26: ffffffc1ecdef080 
[    8.051443] x25: ffffffc1ecdef080 x24: 00000000000001c0 
[    8.051446] x23: 0000000000000001 x22: 0000000000000000 
[    8.051449] x21: ffffffc000f4bd48 x20: ffffffc1ecdef080 
[    8.051452] x19: ffffffc001456858 x18: 0000000000000000 
[    8.051455] x17: 0000007fa6bb9df8 x16: ffffffc000b62a60 
[    8.051458] x15: 0a01000a02020501 x14: 6563653163666666 
[    8.051461] x13: 666666203a69742e x12: 0000000000000000 
[    8.051464] x11: 0101010101010101 x10: ffffffffffff7f7f 
[    8.051468] x9 : fefefec0db8461ff x8 : ffffffc1ecdef8a0 
[    8.051471] x7 : 09432c3334313738 x6 : ffffffc001466000 
[    8.051474] x5 : ffffffc00126d560 x4 : ffffffc001252050 
[    8.051477] x3 : ffffffc00126d418 x2 : 0000000000000000 
[    8.051480] x1 : 0000000000000201 x0 : ffffffc0013d2000 
[    8.051481] 
[    8.051753] ---[ end trace 0394ab4bd3dfd139 ]---
[    8.051754] Call trace:
[    8.051758] [<ffffffc0000a9074>] __local_bh_enable_ip+0x68/0xb8
[    8.051782] [<ffffffc000b57418>] _raw_spin_unlock_bh+0x20/0x28
[    8.051788] [<ffffffc00012c81c>] cgroup_exit+0x58/0xe4
[    8.051791] [<ffffffc0000a7880>] do_exit+0x2a4/0x9a8
[    8.051796] [<ffffffc000089bd8>] bug_handler.part.3+0x0/0x7c
[    8.051801] [<ffffffc000089c9c>] arm64_notify_die+0x1c/0x58
[    8.051804] [<ffffffc000089ef0>] bad_mode+0x84/0x90
[    8.051809] [<ffffffc0003900d0>] pci_device_probe+0x98/0x10c
[    8.051813] [<ffffffc000575344>] driver_probe_device+0xc8/0x408
[    8.051816] [<ffffffc0005757dc>] __device_attach_driver+0x8c/0xb0
[    8.051820] [<ffffffc000573458>] bus_for_each_drv+0x54/0x94
[    8.051823] [<ffffffc00057516c>] __device_attach+0xc0/0x150
[    8.051825] [<ffffffc00057520c>] device_attach+0x10/0x18
[    8.051829] [<ffffffc000385bf8>] pci_bus_add_device+0x3c/0x60
[    8.051831] [<ffffffc000386014>] pci_bus_add_devices+0x44/0x90
[    8.051834] [<ffffffc000386040>] pci_bus_add_devices+0x70/0x90
[    8.051839] [<ffffffc0000941bc>] pci_common_init_dev+0x2d4/0x394
[    8.051847] [<ffffffbffc0bde54>] pcie_delayed_detect+0x894/0xa80 [pci_tegra]
[    8.051850] [<ffffffc0000bc2d0>] process_one_work+0x154/0x434
[    8.051852] [<ffffffc0000bc6e4>] worker_thread+0x134/0x40c
[    8.051856] [<ffffffc0000c1f30>] kthread+0xe0/0xf4
[    8.051860] [<ffffffc000084f90>] ret_from_fork+0x10/0x40
[    8.051938] Unable to handle kernel paging request at virtual address ffffffffffffffd8
[    8.051945] pgd = ffffffc07d453000
[    8.051959] [ffffffffffffffd8] *pgd=00000000fee5a003, *pud=00000000fee5a003, *pmd=0000000000000000
[    8.051964] Internal error: Oops: 96000005 [#2] PREEMPT SMP
[    8.051989] Modules linked in: bcmdhd(+) gpu_beamformer(O) pci_tegra uio_pdrv_genirq uio bluedroid_pm
[    8.051995] CPU: 1 PID: 15 Comm: kworker/1:0 Tainted: G      D W  O    4.4.38-tegra #23
[    8.051999] Hardware name: quill (DT)
[    8.052013] task: ffffffc1ecdef080 ti: ffffffc1ece1c000 task.ti: ffffffc1ece1c000
[    8.052018] PC is at kthread_data+0x4/0xc
[    8.052021] LR is at wq_worker_sleeping+0x10/0xc4
[    8.052026] pc : [<ffffffc0000c25f8>] lr : [<ffffffc0000bd190>] pstate: 600000c5
[    8.052029] sp : ffffffc1ece1f750
[    8.052042] x29: ffffffc1ece1f750 x28: ffffffc1ece1c000 
[    8.052052] x27: ffffffc1e4d42200 x26: ffffffc001252000 
[    8.052062] x25: 0000000000000001 x24: ffffffc000b53aa0 
[    8.052072] x23: 0000000000000000 x22: ffffffc1ecdef610 
[    8.052082] x21: ffffffc001223000 x20: ffffffc1ecdef080 
[    8.052092] x19: ffffffc1f66fe580 x18: ffffffc000bd5120 
[    8.052102] x17: 000000000000000e x16: 0000000000000007 
[    8.052112] x15: 0000000000000001 x14: 0000000000000007 
[    8.052121] x13: 000000000000000e x12: 0000000000000013 
[    8.052131] x11: 000000000000001a x10: 000000000044a850 
[    8.052140] x9 : 000000000044a850 x8 : 0000000000000400 
[    8.052150] x7 : 0000000000000000 x6 : 000000000284a84f 
[    8.052160] x5 : ffffffc1f66fe580 x4 : ffffffc1f66fef60 
[    8.052170] x3 : 000000000001af3b x2 : ffffffc1f66dd000 
[    8.052177] x1 : 0000000000000001 x0 : 0000000000000000 
[    8.052178] 
[    8.052180] Process kworker/1:0 (pid: 15, stack limit = 0xffffffc1ece1c020)
[    8.052181] Call trace:
[    8.052184] [<ffffffc0000c25f8>] kthread_data+0x4/0xc
[    8.052187] [<ffffffc000b536d0>] __schedule+0x348/0x6d4
[    8.052190] [<ffffffc000b53aa0>] schedule+0x44/0xa8
[    8.052192] [<ffffffc0000a7b94>] do_exit+0x5b8/0x9a8
[    8.052195] [<ffffffc000089bd8>] bug_handler.part.3+0x0/0x7c
[    8.052198] [<ffffffc000089c9c>] arm64_notify_die+0x1c/0x58
[    8.052200] [<ffffffc000089ef0>] bad_mode+0x84/0x90
[    8.052203] [<ffffffc0003900d0>] pci_device_probe+0x98/0x10c
[    8.052206] [<ffffffc000575344>] driver_probe_device+0xc8/0x408
[    8.052208] [<ffffffc0005757dc>] __device_attach_driver+0x8c/0xb0
[    8.052213] [<ffffffc000573458>] bus_for_each_drv+0x54/0x94
[    8.052216] [<ffffffc00057516c>] __device_attach+0xc0/0x150
[    8.052234] [<ffffffc00057520c>] device_attach+0x10/0x18
[    8.052237] [<ffffffc000385bf8>] pci_bus_add_device+0x3c/0x60
[    8.052239] [<ffffffc000386014>] pci_bus_add_devices+0x44/0x90
[    8.052241] [<ffffffc000386040>] pci_bus_add_devices+0x70/0x90
[    8.052244] [<ffffffc0000941bc>] pci_common_init_dev+0x2d4/0x394
[    8.052250] [<ffffffbffc0bde54>] pcie_delayed_detect+0x894/0xa80 [pci_tegra]
[    8.052252] [<ffffffc0000bc2d0>] process_one_work+0x154/0x434
[    8.052254] [<ffffffc0000bc6e4>] worker_thread+0x134/0x40c
[    8.052257] [<ffffffc0000c1f30>] kthread+0xe0/0xf4
[    8.052259] [<ffffffc000084f90>] ret_from_fork+0x10/0x40
[    8.052392] ---[ end trace 0394ab4bd3dfd13a ]---
[    8.054399] F1 signature read @0x18000000=0x17214354
[    8.054400] Fixing recursive fault but reboot is needed!
[   36.074307] Watchdog detected hard LOCKUP on cpu 3

I discovered that if the module load is delayed for 15s, or so, there are no issues, however I think this problem is something that should be understood and resolved.

Any help you may provide is greatly appreciated.
Thanks.

I can’t answer your question, but this might get things started.

First, make sure any PCI function you use as a module can actually be in module format. A large part of the PCIe support requires being integrated and not module format. If you’ve written your own driver, and if you’ve written this to be in the form of a module, then try writing it to instead integrated into the kernel and not load as a module.

Second, the stock kernel configuration disables late/delayed probing. People with something like an FPGA require time for the FPGA to boot; either you have to ensure that the FPGA is fully booted before the Jetson starts, or else that you’ve taken effort to enable late/delayed probing (otherwise PCIe may have some of its function disabled…something which would save power if nothing is on PCI).

You might want to describe what is in module format, what is integrated into the kernel, and the nature of anything on PCIe. Other than that I probably can’t add more.

Thanks for your helpful response. It may be better to build the PCIe driver statically as part of the kernel. I didn’t write the driver and am not familiar with PCIe drivers and so currently don’t know enough yet to describe further the module and static components. All I know is that the probe function was not being called due to the above errors, but will investigate further.

I have tried (quick) searches for ‘delayed kernel module config’ and ‘PCI drivers as loadable modules’ but nothing obvious has appeared.

Would it be possible to provide a link explaining how to enable late probing in the kernel? Is this the same as deferred probing?

Also, I would like to understand more the reasoning behind the problems building PCI drivers as kernel modules, so if you have any useful references explaining these issues this would also be helpful to me.

Thanks

Specific changes for late PCIe probing can depend on which release. Someone else will need to give you the specific details, but likely this is also the same as a reference to deferred probing. Keep in mind that if you don’t have PCIe hardware which needs time to boot before providing a valid response, then you don’t need to do this.

The need to integrate a driver into the kernel (versus as a module) is more likely to be useful, and certainly wouldn’t hurt (other than learning curve). Integrating as a non-module might save you some effort guessing at whether module format is even allowed for the specific driver, but perhaps you don’t have the source code…don’t know.

To elaborate a bit on module versus integrated, consider that if some kernel function is built in to the kernel Image, then the moment the bootloader places the kernel into RAM, then the driver is also guaranteed to be present. In the case of a module this content is loaded at some later time…and at that later time the kernel has to read the module from an ext4 filesystem. If for some reason there is an attempt to load a driver on ext4 prior to ext4 itself being loaded, then you have an impossible-to-solve condition.

Additionally, loading a module implies placing the loaded content at some offset address, registering the address, and then doing direct branch instructions to go there. If for some reason the driver did not know it was going to be relocated, then it may not have added the code to support operating as a module. Any init style function would fail.

And of course if the hardware being used is not fully booted/operational at the time of init, then superficially this would be indistinguishable from failure due to not being able to init the driver.

Note that each Linux system has a command to find a version: “uname -r”. This consists of the base kernel source version, plus the CONFIG_LOCALVERSION at the time the kernel is compiled. For example, if the kernel source were “4.9.108”, and if ‘CONFIG_LOCALVERSION="-tegra"’, then “uname -r” would show as “4.9.108-tegra”. When that kernel runs, and when instructed to load modules, then the module search location will be a directory tree here:

/lib/modules/$(uname -r)/

If modules are not there, then you will get 100% module load failure. If the modules are there, but the ext4 driver is itself a module, then you will also get 100% module load failure (don’t worry, ext4 is always integrated…I’m just giving an example). One reason why the modules would not be there is if you forgot to set CONFIG_LOCALVERSION to “-tegra” at the time of compile…then the existing modules would end up being searched for at “/lib/modules/4.9.108/” (example version) instead of “/lib/modules/4.9.108-tegra/”.

My suggestion is that you give a detailed description of the PCIe device…whether this device might need to boot (e.g., FPGA), versus whether it should be available immediately after power on (such as a network card). Also describe the driver…whether it was designed by you or someone else, and if you know, then whether the driver supports module format. Then someone can give a better answer.

Here is an example of late probing, but this is extremely unlikely to contain the exact changes you would need for your current system (this is on an older TX1, not a newer TX2):
https://devtalk.nvidia.com/default/topic/997845/jetson-tx1/force-rescan-of-pcie-bus-/post/5102006/#5102006