Jetpack-4.5.1 silent panic when booting, with kernel patch

Hello,

I have flashed a jetson TX2 on a devkit with jetpack-4.5.1 using the sdkmanager. The vanilla installation seems to work. I have actuallly a custom daughter-board that I can plug on the devkit, and I have created a dt containing the description of the devkit and the daughter-board.
I used to do exactly the same with jetpack-4.3, and that worked perfectly, but now with jetpack-4.5.1, starting linux fails in the early boot phase :

Retrieving file: /boot/dtbfile
260918 bytes read in 33 ms (7.5 MiB/s)
## Flattened Device Tree blob at 88400000
   Booting using the fdt blob at 0x88400000
ERROR: reserving fdt memory region failed (addr=0 size=0)
ERROR: reserving fdt memory region failed (addr=0 size=0)
ERROR: reserving fdt memory region failed (addr=0 size=0)
   Using Device Tree in place at 0000000088400000, end 0000000088442b35
copying carveout for /host1x@13e00000/display-hub@15200000/display@15200000...
copying carveout for /host1x@13e00000/display-hub@15200000/display@15210000...
copying carveout for /host1x@13e00000/display-hub@15200000/display@15220000...

Starting kernel ...

[    0.000000] Booting Linux on physical CPU 0x100
[    0.000000] Linux version 4.9.201-jp451-0.macq~.jp451.fix.boot.crash-gc636435 (jenkinsbld@docker-macq-build-ubuntu18.04-64) (gcc version 7.5.0 (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) ) #1 SMP PREEMPT Tue Jun 8 08:40:16 CEST 2021
[    0.000000] Boot CPU: AArch64 Processor [411fd073]
[    0.000000] OF: fdt:memory scan node memory@80000000, reg size 80,
[    0.000000] OF: fdt: - 80000000 ,  70000000
[    0.000000] OF: fdt: - f0200000 ,  185600000
[    0.000000] OF: fdt: - 275e00000 ,  200000
[    0.000000] OF: fdt: - 276600000 ,  200000
[    0.000000] OF: fdt: - 277000000 ,  200000
[    0.000000] earlycon: uart8250 at MMIO32 0x0000000003100000 (options '')
[    0.000000] bootconsole [uart8250] enabled
[    0.000000] OF: fdt:Reserved memory: failed to reserve memory for node 'fb2_carveout': base 0x0000000000000000, size 0 MiB
[    0.000000] OF: fdt:Reserved memory: failed to reserve memory for node 'fb2_carveout': base 0x0000000000000000, size 0 MiB
[    0.000000] OF: fdt:Reserved memory: failed to reserve memory for node 'fb1_carveout': base 0x0000000000000000, size 0 MiB
[    0.000000] OF: fdt:Reserved memory: failed to reserve memory for node 'fb1_carveout': base 0x0000000000000000, size 0 MiB
[    0.000000] OF: fdt:Reserved memory: failed to reserve memory for node 'fb0_carveout': base 0x0000000000000000, size 0 MiB
[    0.000000] OF: fdt:Reserved memory: failed to reserve memory for node 'fb0_carveout': base 0x0000000000000000, size 0 MiB
[    0.000000] OF: reserved mem: initialized node vpr-carveout, compatible id nvidia,vpr-carveout
[    0.000000] OF: reserved mem: initialized node ramoops_carveout, compatible id nvidia,ramoops
[    0.000000] cma: Reserved 64 MiB at 0x00000000fc000000
[    0.000000] psci: probing for conduit method from DT.
[    0.000000] psci: PSCIv1.0 detected in firmware.
[    0.000000] psci: Using standard PSCI v0.2 function IDs
[    0.000000] psci: MIGRATE_INFO_TYPE not supported.
[    0.000000] psci: SMC Calling Convention v1.1
[    0.000000] percpu: Embedded 24 pages/cpu s58200 r8192 d31912 u98304
[    0.000000] Speculative Store Bypass Disable mitigation not required
[    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 2022968
[    0.000000] Kernel command line: console=ttyS0,115200 androidboot.presilicon=true firmware_class.path=/etc/firmware root=/dev/mmcblk0p1 rw rootwait rootfstype=ext4 console=ttyS0,115200n8 console=tty0 fbcon=map:0 net.ifnames=0 isolcpus=1-2  video=tegrafb no_console_suspend=1 earlycon=uart8250,mmio32,0x3100000 nvdumper_reserved=0x2772e0000 gpt rootfs.slot_suffix= usbcore.old_scheme_first=1 tegraid=18.1.2.0.0 maxcpus=6 boot.slot_suffix= boot.ratchetvalues=0.2031647.1 vpr_resize bl_prof_dataptr=0x10000@0x275840000 sdhci_tegra.en_boot_part_access=1
[    0.000000] log_buf_len individual max cpu contribution: 32768 bytes
[    0.000000] log_buf_len total cpu_extra contributions: 163840 bytes
[    0.000000] log_buf_len min size: 262144 bytes
[    0.000000] log_buf_len: 524288 bytes
[    0.000000] early log buf free: 258616(98%)
[    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
[    0.000000] Dentry cache hash table entries: 1048576 (order: 11, 8388608 bytes)
[    0.000000] Inode-cache hash table entries: 524288 (order: 10, 4194304 bytes)
[    0.000000] Memory: 7292184K/8220672K available (15166K kernel code, 2692K rwdata, 5936K rodata, 2688K init, 851K bss, 174824K reserved, 753664K cma-reserved)
[    0.000000] Virtual kernel memory layout:
[    0.000000]     modules : 0xffffff8000000000 - 0xffffff8008000000   (   128 MB)
[    0.000000]     vmalloc : 0xffffff8008000000 - 0xffffffbebfff0000   (   250 GB)
[    0.000000]       .text : 0xffffff8008080000 - 0xffffff8008f50000   ( 15168 KB)
[    0.000000]     .rodata : 0xffffff8008f50000 - 0xffffff8009520000   (  5952 KB)
[    0.000000]       .init : 0xffffff8009520000 - 0xffffff80097c0000   (  2688 KB)
[    0.000000]       .data : 0xffffff80097c0000 - 0xffffff8009a61008   (  2693 KB)
[    0.000000]        .bss : 0xffffff8009a61008 - 0xffffff8009b35f4c   (   852 KB)
[    0.000000]     fixed   : 0xffffffbefe7fd000 - 0xffffffbefec00000   (  4108 KB)
[    0.000000]     PCI I/O : 0xffffffbefee00000 - 0xffffffbeffe00000   (    16 MB)
[    0.000000]     vmemmap : 0xffffffbf00000000 - 0xffffffc000000000   (     4 GB maximum)
[    0.000000]               0xffffffbf00000000 - 0xffffffbf07dc8000   (   125 MB actual)
[    0.000000]     memory  : 0xffffffc000000000 - 0xffffffc1f7200000   (  8050 MB)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=6, Nodes=1
[    0.000000] Preemptible hierarchical RCU implementation.
[    0.000000]  Build-time adjustment of leaf fanout to 64.
[    0.000000]  RCU restricting CPUs from NR_CPUS=64 to nr_cpu_ids=6.
[    0.000000] RCU: Adjusting geometry for rcu_fanout_leaf=64, nr_cpu_ids=6
[    0.000000] NR_IRQS:64 nr_irqs:64 0
[    0.000000] GIC: Using split EOI/Deactivate mode
[    0.000000] arm_arch_timer: Architected cp15 timer(s) running at 31.25MHz (phys).
[    0.000000] clocksource: arch_sys_counter: mask: 0xffffffffffffff max_cycles: 0xe6a171046, max_idle_ns: 881590405314 ns
[    0.000003] sched_clock: 56 bits at 31MHz, resolution 32ns, wraps every 4398046511088ns
[    0.009414] Console: colour dummy device 80x25
[    0.014074] console [tty0] enabled
[    0.017628] bootconsole [uart8250] disabled

and then it reboots :(

struggling with kernel sources, I finally discovered that there is a panic about dereferencing a null pointer but that it happens after bootconsole is disabled but before console has a chance to print anything, and tracked it down to a use after free of a dynamically allocated struct in drivers/iommu/arm-smmu.c

More details :
arm_smmu_device_dt_probe allocates a ‘struct arm_smmu_device’ using devm_kzalloc, and keeps a reference to it in ‘smmu_handle’. If for any reason arm_smmu_device_dt_probe returns with an error, all the memory allocated using devm_ family functions is freed, including the one pointed by ‘smmu_handle’ at the following point

[    5.227613] [<ffffff800822ab68>] kfree+0x2d0/0x2d8
[    5.232617] [<ffffff800882cee0>] release_nodes+0x138/0x208
[    5.238338] [<ffffff800882d454>] devres_release_all+0x3c/0x60
[    5.244336] [<ffffff8008827ee0>] driver_probe_device+0x2b0/0x450
[    5.250607] [<ffffff8008828250>] __device_attach_driver+0xa8/0x148
[    5.257059] [<ffffff8008825888>] bus_for_each_drv+0x58/0xa8
[    5.262878] [<ffffff8008827a8c>] __device_attach+0xbc/0x138
[    5.268704] [<ffffff800882836c>] device_initial_probe+0x24/0x30
[    5.274887] [<ffffff8008826bb4>] bus_probe_device+0x9c/0xa8
[    5.280705] [<ffffff8008824218>] device_add+0x3d0/0x5d8
[    5.286161] [<ffffff8008c448d0>] of_device_add+0x40/0x50
[    5.291712] [<ffffff8008c450ac>] of_platform_device_create_pdata+0x9c/0x100
[    5.298989] [<ffffff8008c45148>] of_platform_device_create+0x38/0x48
[    5.305629] [<ffffff800956ebb8>] arm_smmu_of_setup+0xdc/0x118
[    5.311625] [<ffffff800956e7b0>] of_iommu_init+0x48/0x90
[    5.317172] [<ffffff8008083bfc>] do_one_initcall+0x104/0x148
[    5.323086] [<ffffff8009530d10>] kernel_init_freeable+0x1bc/0x25c
[    5.329456] [<ffffff8008f3e4a0>] kernel_init+0x18/0x108
[    5.334918] [<ffffff80080838a0>] ret_from_fork+0x10/0x30

The freed memory is later allocated to some other kernel driver, and when smmu_handle is again used, one gets this :

[    3.271180] Call trace:
[    3.273735] [<ffffff800870e42c>] arm_smmu_add_device+0x124/0x5e0
[    3.280006] [<ffffff8008706170>] iommu_bus_notifier+0xe8/0x138
[    3.286095] [<ffffff80080dc66c>] notifier_call_chain+0x5c/0xa0
[    3.292184] [<ffffff80080dd12c>] blocking_notifier_call_chain+0x64/0x88
[    3.299092] [<ffffff8008823f8c>] device_add+0x3bc/0x5d8
[    3.304549] [<ffffff8008c44568>] of_device_add+0x40/0x50
[    3.310095] [<ffffff8008c44d44>] of_platform_device_create_pdata+0x9c/0x100
[    3.317363] [<ffffff8008c45034>] of_platform_bus_create+0x104/0x468
[    3.323908] [<ffffff8008c4559c>] of_platform_populate+0x8c/0x140
[    3.330181] [<ffffff80095712bc>] of_platform_default_populate_init+0x68/0x7c
[    3.337541] [<ffffff8008083bf0>] do_one_initcall+0xf8/0x130
[    3.343358] [<ffffff8009520d10>] kernel_init_freeable+0x1bc/0x25c
[    3.349723] [<ffffff8008f3df70>] kernel_init+0x18/0x108
[    3.355178] [<ffffff80080838a0>] ret_from_fork+0x10/0x30
[    3.360738] ---[ end trace 795dde86e029b986 ]---
[    3.369631] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    3.369631]
[    3.379187] SMP: stopping secondary CPUs
[    3.387232] Rebooting in 5 seconds..

Here is a possible patch to avoid the silent panic :

index 35735329bde4..3fdf5baca54c 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -2771,7 +2771,29 @@  arm_smmu_device_dt_probe(struct platform_device *pdev)
        if (tegra_platform_is_unit_fpga())
                return -ENODEV;

-       smmu = devm_kzalloc(dev, sizeof(*smmu), GFP_KERNEL);
+       smmu = kzalloc(sizeof(*smmu), GFP_KERNEL);
        if (!smmu) {
                dev_err(dev, "failed to allocate arm_smmu_device\n");
                return -ENOMEM;

I don’t know if the smmu driver works really well thereafter, but at least the kernel does not crash and we get some informative messages.

may I know how did you integrate to JetPack-4.5.1, are you using an OTA update?

may I know how did you integrate to JetPack-4.5.1, are you using an OTA update?

I did not use OTA.

It is on a devkit I have locally.

Previously I had installed jetpack-4.3 on it, using sdkmanager, and then I added an entry in /boot/extlinux/extlinux.conf, specifying my dtb and kernel, which were based on the tegra-l4t-r32.3.1 sources, with some patches to describe my daughter-board and add drivers for its components. That worked perfectly.

I have now done exactly the same steps with jetpack-4.5.1 : installing the full original jetpack-4.5.1 on my devkit using the sdkmanager, and adding an entry with my dtb and kernel in /boot/extlinux/extlinux.conf, with my new dtb and kernel, that I have compiled after having rebased all my sources in the kernel/ and hardware/ directories on tegra-l4t-r32.5.1 using the following command :

git rebase --onto=tegra-l4t-r32.5.1 tegra-l4t-r32.3.1 mybranch

That’s all

hello phdm,

just to double confirm you’re syncing the r32.5.1 sources correctly.
please check whether your git log as identical as my code-line.
for example,
/kernel/nvidia$ git log --oneline

6dc57fe thermal: continuous: add custom dt entry support
3e3d6b4 drivers: therm_fan_est: Add crit_temp update sysfs node
1cddff3 pwm: fan: Add support for always on fan
879397a thermal: pwm-fan: Add support for tmargin
c86a584 video: tegra: host: Reorder poll event creation to avoid potential race
...

Yes, I have those commits in my branch :

6dc57fec3 (tag: tegra-l4t-r32.5.1, tag: tegra-l4t-r32.5, upstream/l4t/l4t-r32.5) thermal: continuous: add custom dt entry support
3e3d6b491 drivers: therm_fan_est: Add crit_temp update sysfs node
1cddff3fa pwm: fan: Add support for always on fan
879397afa thermal: pwm-fan: Add support for tmargin
c86a58410 video: tegra: host: Reorder poll event creation to avoid potential race

hello phdm,

we just recently found there’s an issue by using FDT entry to specify device tree blob.
had you already exclude that and load dtb via kernel-dtb partition for testing?
thanks

Sorry, I think we are in the wrong thread . For the ‘nvpmodel -m 0’ problem, the DT is loaded correctly.

Here, for the silent panic in the linux boot phase, the bug I spotted in the kernel sources is activated when the DT is not loaded correctly, but could also happen if there is some missing property in the DT.

hello phdm,

had you complete pinmux spreadsheet customization, and using the generated cfg file to perform whole flash, after that, it cause boot hang?

No, I flashed the default TX2 jetpack-4.5.1 on the devkit with sdkmanager. The boot hang is caused by a combination of u-boot not correctly loading the FDT file and the bug in the smmu driver. The bug in the smmu driver can be worked around in drivers/iommu/arm-smmu.c, like I proposed above, or fixed in many other better ways, I assume.

Hi @JerryChang , I’m seeing something similar which might be the same issue but on the TX2 NX. If the FDT is populated in extlinux.conf to the appropriate standard, unmodified device-tree for this device - tegra186-p3636-0001-p3509-0000-a01.dtb -
booting hangs at this line:

Retrieving file: /boot/tegra186-p3636-0001-p3509-0000-a01.dtb
191624 bytes read in 49 ms (3.7 MiB/s)
## Flattened Device Tree blob at 82400000
   Booting using the fdt blob at 0x82400000
ERROR: reserving fdt memory region failed (addr=0 size=0)
ERROR: reserving fdt memory region failed (addr=0 size=0)
ERROR: reserving fdt memory region failed (addr=0 size=0)
   Using Device Tree in place at 0000000082400000, end 0000000082431c87
copying carveout for /host1x@13e00000/display-hub@15200000/display@15200000...
copying carveout for /host1x@13e00000/display-hub@15200000/display@15210000...
copying carveout for /host1x@13e00000/display-hub@15200000/display@15220000...
DT property /chosen/nvidia,bluetooth-mac missing in source; can't copy
DT property /chosen/nvidia,wifi-mac missing in source; can't copy

Starting kernel ...

[    0.000000] Booting Linux on physical CPU 0x100
[    0.000000] Linux version 4.9.201-l4t-r32.5 (oe-user@oe-host) (gcc version 9.3.0 (GCC) ) #1 SMP PREEMPT Thu May 6 13:07:24 UTC 2021
[    0.000000] Boot CPU: AArch64 Processor [411fd073]
[    0.000000] OF: fdt:memory scan node memory@80000000, reg size 80,
[    0.000000] OF: fdt: - 80000000 ,  70000000
[    0.000000] OF: fdt: - f0200000 ,  85600000
[    0.000000] OF: fdt: - 175e00000 ,  200000
[    0.000000] OF: fdt: - 176600000 ,  200000
[    0.000000] OF: fdt: - 177000000 ,  200000
[    0.000000] earlycon: uart8250 at MMIO32 0x0000000003100000 (options '')
[    0.000000] bootconsole [uart8250] enabled
[    0.000000] Found tegra_fbmem: 00800000@96081000
[    0.000000] Found lut_mem: 00002008@9607e000
[    0.000000] OF: fdt:Reserved memory: failed to reserve memory for node 'fb2_carveout': base 0x0000000000000000, size 0 MiB
[    0.000000] OF: fdt:Reserved memory: failed to reserve memory for node 'fb2_carveout': base 0x0000000000000000, size 0 MiB
[    0.000000] OF: fdt:Reserved memory: failed to reserve memory for node 'fb1_carveout': base 0x0000000000000000, size 0 MiB
[    0.000000] OF: fdt:Reserved memory: failed to reserve memory for node 'fb1_carveout': base 0x0000000000000000, size 0 MiB
[    0.000000] OF: reserved mem: initialized node vpr-carveout, compatible id nvidia,vpr-carveout
[    0.000000] OF: reserved mem: initialized node ramoops_carveout, compatible id nvidia,ramoops
[    0.000000] cma: Reserved 64 MiB at 0x00000000fc000000
[    0.000000] psci: probing for conduit method from DT.
[    0.000000] psci: PSCIv1.0 detected in firmware.
[    0.000000] psci: Using standard PSCI v0.2 function IDs
[    0.000000] psci: MIGRATE_INFO_TYPE not supported.
[    0.000000] psci: SMC Calling Convention v1.1

I’m using the 32.5.1 u-boot from nv-tegra.nvidia.com Git - 3rdparty/u-boot.git/commit
U-Boot 2020.04 (Dec 17 2020 - 23:03:28 +0000)

Second related issue I see is that this element “/chosen/bootargs:” present in fdt_copy_prop_paths makes ANY extra kernel cmdline arguments passed trough APPEND in extlinux.conf to be ignored:

https://nv-tegra.nvidia.com/gitweb/?p=3rdparty/u-boot.git;a=blob;f=include/configs/tegra186-common.h;h=42141fddee5bc7f0792bb2ea269f3ca8f4f9824b;hb=6b630d64fd86beec3efb3312581c50ee8e23a05b#l61

so only the arguments from cboot get to the final kernel cmdline.

Do you know if these two issues will be addressed in u-boot 32.5.1? Thank you

for this issue,
please refer to Jetpack 4.5.1, TX2, BUG : FDT selected file loaded incorrectly by uboot - #16 by JerryChang,
you could apply the u-boot binary for confirmation,
thanks

Thanks for the quick reply @JerryChang , I can confirm that the u-boot in the above thread solves the kernel cmdline extra arguments issue.

Do you know if there are some patches in upstream that fix the args issue and the FDT one, even if they are pending?

Thank you

hello AlexCo,

please check this comment for the details, Jetpack 4.5.1, TX2, BUG : FDT selected file loaded incorrectly by uboot - #14 by TWarren.
this bug fix has already included to the next public release code-line, you should expect JetPack-4.6 will include the fix.
thanks