OrinNX 8G kernel panic while inference aging

Hello.

When performing Batch Inference using TensorRT in the environment below, a Kernel Panic occurs.

  • Jetson Orin NX 8G

  • JetPack 5.1.2

  • dynamic shape model - 8 batch operation

  • custom board

[103613.788356] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000038
[103613.797519] Mem abort info:
[103613.800474] ESR = 0x96000006
[103613.803736] EC = 0x25: DABT (current EL), IL = 32 bits
[103613.809301] SET = 0, FnV = 0
[103613.812527] EA = 0, S1PTW = 0
[103613.815859] Data abort info:
[103613.818913] ISV = 0, ISS = 0x00000006
[103613.822952] CM = 0, WnR = 0
[103613.826103] user pgtable: 4k pages, 48-bit VAs, pgdp=000000013d2d0000
[103613.832875] [0000000000000038] pgd=0000000194e15003, p4d=0000000194e15003, pud=000000013d2d2003, pmd=0000000000000000
[103613.843886] Internal error: Oops: 96000006 [#1] PREEMPT SMP
[103613.849697] Modules linked in: ip6table_filter(E) ip6_tables(E) xt_conntrack(E) xt_MASQUERADE(E) nf_conntrack_netlink(E) nfnetlink(E) iptable_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) xt_addrtype(E) iptable_filter(E) br_netfilter(E) lzo_rle(E) lzo_compress(E) rtc_rv8803(E) ia_apl_resource zram(E) overlay(E) ramoops(E) reed_solomon(E) loop(E) snd_soc_tegra210_iqc(E) snd_soc_tegra186_asrc(E) snd_soc_tegra210_ope(E) snd_soc_tegra186_dspk(E) snd_soc_tegra186_arad(E) snd_soc_tegra210_mvc(E) snd_soc_tegra210_afc(E) snd_soc_tegra210_dmic(E) snd_soc_tegra210_adx(E) snd_soc_tegra210_mixer(E) snd_soc_tegra210_amx(E) snd_soc_tegra210_i2s(E) snd_soc_tegra210_admaif(E) snd_soc_tegra210_sfc(E) snd_soc_tegra_pcm(E) aes_ce_blk(E) crypto_simd(E) cryptd(E) aes_ce_cipher(E) ghash_ce(E) sha2_ce(E) sha256_arm64(E) sha1_ce(E) snd_soc_tegra_machine_driver(E) snd_soc_spdif_tx(E) snd_soc_tegra210_adsp(E) snd_soc_tegra_utils(E) snd_soc_simple_card_utils(E)
[103613.849752] snd_soc_tegra210_ahub(E) nvadsp(E) spi_tegra114(E) userspace_alert(E) tegra210_adma(E) snd_hda_codec_hdmi(E) tegra_bpmp_thermal(E) snd_hda_tegra(E) snd_hda_codec(E) snd_hda_core(E) r8168(E) nvidia(OE) binfmt_misc(E) ina3221(E) pwm_fan(E) nvgpu(E) nvmap(E) ip_tables(E) x_tables(E) [last unloaded: mtd]
[103613.968389] CPU: 4 PID: 4093291 Comm: ObjectDetector Tainted: G OE 5.10.120-tegra #2
[103613.977745] Hardware name: Unknown NVIDIA Orin NX Developer Kit/NVIDIA Orin NX Developer Kit, BIOS 4.1-33958178 08/01/2023
[103613.989163] pstate: 20400009 (nzCv daif +PAN -UAO -TCO BTYPE=–)
[103613.995418] pc : nvmap_free_handle_from_fd+0xd0/0x240 [nvmap]
[103614.001403] lr : nvmap_free_handle_from_fd+0x80/0x240 [nvmap]
[103614.007385] sp : ffff80004235bd60
[103614.010864] x29: ffff80004235bd60 x28: ffff16a88e78ac40
[103614.016390] x27: 0000000000000000 x26: 0000000000000000
[103614.021915] x25: 0000000000000000 x24: 0000000000000000
[103614.027442] x23: 0000000000000000 x22: 0000000000000000
[103614.032975] x21: 0000000080000279 x20: ffff16a8b85af980
[103614.038513] x19: ffff16a79c653400 x18: 0000000000000000
[103614.044041] x17: 0000000000000000 x16: ffffcf90bfcff430
[103614.049565] x15: 0000000000000000 x14: 0000000000000000
[103614.055090] x13: 0000000000000000 x12: 0000000000000000
[103614.060686] x11: ffff80004235bce8 x10: 0000000000000002
[103614.066232] x9 : 0000000000000001 x8 : 0000000000000238
[103614.071766] x7 : 0000000147358000 x6 : 0000000000000018
[103614.077291] x5 : ffffcf90809a4800 x4 : 0000000000000003
[103614.082819] x3 : 0000000000000002 x2 : ffff16a8af360000
[103614.088346] x1 : 0000000000000003 x0 : 0000000000000000
[103614.093880] Call trace:
[103614.096476] nvmap_free_handle_from_fd+0xd0/0x240 [nvmap]
[103614.102087] nvmap_ioctl_free+0x44/0x90 [nvmap]
[103614.106825] __traceiter_refcount_free_handle+0x2784/0x6010 [nvmap]
[103614.113343] __arm64_sys_ioctl+0xac/0xf0
[103614.117453] el0_svc_common.constprop.0+0x80/0x1d0
[103614.122442] do_el0_svc+0x38/0xb0
[103614.125923] el0_svc+0x1c/0x30
[103614.129128] el0_sync_handler+0xa8/0xb0
[103614.133150] el0_sync+0x16c/0x180
[103614.136632] Code: b40000c2 f9400440 f00000a5 912000a5 (f9401c04)
[103614.142969] —[ end trace 4523cf6e7a9f4e27 ]—
[103614.153640] Kernel panic - not syncing: Oops: Fatal exception
[103614.159629] SMP: stopping secondary CPUs
[103614.163744] Kernel Offset: 0x4f90afcc0000 from 0xffff800010000000
[103614.170073] PHYS_OFFSET: 0xffffe95980000000
[103614.174446] CPU features: 0x08040006,4a80aa38
[103614.178998] Memory Limit: none
[103614.188011] —[ end Kernel panic - not syncing: Oops: Fatal exception ]—
y?RNING @ [platform/drivers/mailbox/ivc_link_provider/mail_imo.c]: mail imo TX timeout
failed to send thermal trip message to cpu!
WARNING @ [platform/drivers/mailbox/ivc_link_provider/mail_imo.c]: mail imo TX timeout
failed to send thermal trip message to cpu!
WARNING @ [platform/drivers/mailbox/ivc_link_provider/mail_imo.c]: mail imo TX timeout
failed to send thermal trip message to cpu!
WARNING @ [platform/drivers/mailbox/ivc_link_provider/mail_imo.c]: mail imo TX timeout
failed to send thermal trip message to cpu!
y

The occurrence cycle is random during 10 to 24 hours of aging testing.

When aging with a static 1-batch model in 8 separate threads under the same conditions, no Kernel Panic occurred.

The problem appears after changing to dynamic batching, so please confirm if there are any debugging points or if this is a known issue.

Please note that upgrading JetPack is not possible in our situation.

Thank you.

Hi,

Could you test if this issue can also be reproduced on the devkit?
Thanks.