Hi NVIDIA team,
We have been trying to integrate R35.6.0 JetPack for our product, based on Jetson Xavier NX, but we are hitting occasional Kernel oops. For example (from the serial console):
[13275.636315] soctherm: OC ALARM 0x00000001
[41816.879799] kernel BUG at mm/slub.c:4118!
[41816.879959] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[41816.880155] Modules linked in: fuse cfg80211 tcp_diag inet_diag veth nfnetlink_acct xt_mark xt_MASQUERADE nf_conntrack_netlink br_netfilter overlay ramoops reed_solomon realtek nfnetlink smsc ip6table_nat iptable_nat nf_nat smsc95xx loop snd_soc_tegra210_iqc snd_soc_tegra186_asrc snd_soc_tegra210_op
e snd_soc_tegra186_arad snd_soc_tegra186_dspk snd_soc_tegra210_mvc snd_soc_tegra210_afc snd_soc_tegra210_dmic snd_soc_tegra210_adx snd_soc_tegra210_amx snd_soc_tegra210_i2s snd_soc_tegra210_mixer snd_soc_tegra210_admaif snd_soc_tegra_pcm snd_soc_tegra210_sfc aes_ce_blk crypto_simd cryptd aes_ce_cipher
binfmt_misc ghash_ce sha2_ce sha256_arm64 sha1_ce snd_soc_spdif_tx snd_soc_tegra_machine_driver snd_soc_tegra210_adsp snd_soc_tegra_utils snd_soc_simple_card_utils nvadsp userspace_alert tegra_bpmp_thermal snd_soc_tegra210_ahub max77620_thermal tegra210_adma snd_hda_codec_hdmi nv_imx219 snd_hda_tegra s
nd_hda_codec snd_hda_core spi_tegra114 nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt
[41816.880458] nf_log_ipv4 ina3221 nf_log_common pwm_fan ipt_REJECT nf_reject_ipv4 xt_LOG nvgpu(E) xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip6table_filter ip6_tables nvmap iptable_filter ip_tables x_tables [last unloaded: leds_gpio]
[41816.912120] CPU: 1 PID: 2663 Comm: P[cgroups] Tainted: G E 5.10.216-tegra #1
[41816.920246] Hardware name: NVIDIA NVIDIA Jetson Xavier NX Developer Kit/Jetson, BIOS 202210.5-78a917ec-dirty 09/25/2024
[41816.931538] pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--)
[41816.937581] pc : kfree+0x41c/0x4a0
[41816.940987] lr : cgroup_file_release+0x6c/0xc0
[41816.945185] sp : ffff800029d73c90
[41816.948856] x29: ffff800029d73c90 x28: ffff394da1005880
[41816.954114] x27: 0000000000000000 x26: 0000000000000000
[41816.959882] x25: 0000000000000000 x24: 0000000000000000
[41816.965148] x23: ffffa3e81e374000 x22: ffff394da1005880
[41816.970905] x21: ffff394fa8142600 x20: 0000000000000000
[41816.976161] x19: fffffee53e805080 x18: 0000000000000000
[41816.981757] x17: 0000000000000000 x16: 0000000000000000
[41816.987010] x15: 0000000000000000 x14: 0000000000000000
[41816.992781] x13: 0000000000000000 x12: 0000000000000000
[41816.998034] x11: 0000000000000000 x10: 0000000000000000
[41817.003629] x9 : 0000000000000000 x8 : 0000000000000002
[41817.009142] x7 : 07ffffffffffffff x6 : ffff394da11d7d00
[41817.014420] x5 : ffff394da846e9a0 x4 : ffffa3e81e380b28
[41817.019825] x3 : 0000000000000000 x2 : ffffa3e81c492880
[41817.025158] x1 : fffffee53e8050c8 x0 : fffffee53e8050c8
[41817.030773] Call trace:
[41817.033215] kfree+0x41c/0x4a0
[41817.036363] cgroup_file_release+0x6c/0xc0
[41817.040149] kernfs_fop_release+0xa0/0xc0
[41817.044407] __fput+0x80/0x260
[41817.047382] ____fput+0x24/0x30
[41817.050534] task_work_run+0x88/0xe0
[41817.054036] do_notify_resume+0x24c/0x990
[41817.057975] work_pending+0xc/0x738
[41817.061738] Code: f9400660 3707fae0 a9046bf9 f9002bfb (d4210000)
[41817.067868] ---[ end trace 4690051af44342d3 ]---
[41817.088105] Kernel panic - not syncing: Oops - BUG: Fatal exception
[41817.088310] SMP: stopping secondary CPUs
[41817.088444] Kernel Offset: 0x23e80c2c0000 from 0xffff800010000000
[41817.088664] PHYS_OFFSET: 0xffffc6b380000000
[41817.092182] CPU features: 0x48240002,03802a30
[41817.096725] Memory Limit: none
[41817.111103] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]---
^@<FF><E2>
[0000.025] W> RATCHET: MB1 binary ratchet value 4 is larger than ratchet level 2 from HW fuses.
[0000.033] I> MB1 (prd-version: 2.6.0.0-t194-41334769-cab45716)
[0000.038] I> Boot-mode: Coldboot
[0000.041] I> Platform: Silicon
[0000.044] I> Chip revision : A02P
[0000.047] I> Bootrom patch version : 15 (correctly patched)
We are running a stock Linux kernel and nearly stock bootloader (we change the boot logo when we build it):
<username>@<machine-name>:~$ uname -a
Linux ep-1026-xavier 5.10.216-tegra #1 SMP PREEMPT Fri Sep 13 08:55:39 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
<username>@<machine-name>:~$ sudo nvbootctrl dump-slots-info
Current version: 35.6.0
Capsule update status: 1
Current bootloader slot: B
Active bootloader slot: B
num_slots: 2
slot: 0, status: normal
slot: 1, status: normal
Our workload consists of Docker containers interacting with the serial ports, USB, HDMI & NVMe drive.
Is this something you have seen before with this release?