模组信息:xavier 32G
系统版本:
Software part of jetson-stats 4.2.8 - (c) 2024, Raffaello Bonghi
Model: Jetson-AGX - Jetpack 5.1.3 [L4T 35.5.0]
NV Power Mode[0]: MAXN
Serial Number: [XXX Show with: jetson_release -s XXX]
Hardware:
- 699-level Part Number: 699-82888-0004-400 R.0
- P-Number: p2888-0004
- Module: NVIDIA Jetson AGX Xavier (32 GB ram)
- SoC: tegra194
- CUDA Arch BIN: 7.2
- Codename: Galen
Platform:
- Machine: aarch64
- System: Linux
- Distribution: Ubuntu 20.04 Focal Fossa
- Release: 5.10.192
- Python: 3.8.10
jtop:
- Version: 4.2.8
- Service: Inactive
Libraries:
- CUDA: 11.4.315
- cuDNN: 8.6.0.166
- TensorRT: 8.5.2.2
- VPI: 2.4.8
- Vulkan: 1.3.204
- OpenCV: 4.5.4 - with CUDA: NO
问题描述:
系统正常运行中,cpu以及内存没有出现明显异常,偶现内核panic导致系统重启
日志如下:
099_panic.tar.gz (92.9 KB)
1 Like
Hi,
Is there any specific driver running to trigger this error?
您好,感谢回复。系统中没有运行什么特殊的驱动(CAN不算特殊驱动的话),从我抓取到的panic日志里面来看,多次的panic打印的堆栈信息都比较相似,是否JP513的内存回收管理之类的bug导致的问题?另外我们运行的程序负载较高
Hi,
It looks like there is always error from “gst-plugin-scan[278814]: unhandled exception:” when you hit this issue.
I would say you may need to provide a method to reproduce on NV devkit. Otherwise we may not able to look into it.
您好,我们其他xavier上也有出现相似的panic,日志中没有出现gst-plugin-scan的错误。另外想请教下,gst-plugin-scan可以卸载或者禁用掉吗
您好,从panic日志中,能否分析到一些有用信息呢,其他客户又没有遇到过类似问题,我们多台域控panic日志,都跟这个打印类似:
[ 176.556292] Unable to handle kernel paging request at virtual address fffffffffdfbbf3f
[ 176.556548] Mem abort info:
[ 176.556618] ESR = 0x96000021
[ 176.556704] EC = 0x25: DABT (current EL), IL = 32 bits
[ 176.556820] SET = 0, FnV = 0
[ 176.556906] EA = 0, S1PTW = 0
[ 176.556974] Data abort info:
[ 176.557035] ISV = 0, ISS = 0x00000021
[ 176.557111] CM = 0, WnR = 0
[ 176.557178] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000007bf510000
[ 176.557304] [fffffffffdfbbf3f] pgd=000000087fe4f003, p4d=000000087fe4f003, pud=0000000000000000
[ 176.557484] Internal error: Oops: 96000021 [#1] PREEMPT SMP
[ 176.557592] Modules linked in: xt_nat xt_tcpudp veth fuse xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c xt_addrtype iptable_filter br_netfilter switch mttcan can_dev can_raw can lzo_rle lzo_compress zram overlay ramoops reed_solomon mv88e6xxx dsa_core loop camera ox03c10 imx390 nvgpu snd_soc_tegra186_asrc snd_soc_tegra210_ope snd_soc_tegra186_dspk snd_soc_tegra186_arad snd_soc_tegra210_mvc snd_soc_tegra210_iqc snd_soc_tegra210_afc snd_soc_tegra210_dmic snd_soc_tegra210_i2s snd_soc_tegra210_adx snd_soc_tegra210_amx snd_soc_tegra210_admaif snd_soc_tegra_pcm snd_soc_tegra210_sfc snd_soc_tegra210_mixer aes_ce_blk crypto_simd cryptd aes_ce_cipher ghash_ce sha2_ce snd_soc_tegra_machine_driver snd_soc_tegra210_adsp sha256_arm64 sha1_ce snd_soc_tegra_utils binfmt_misc pwm_fan snd_soc_simple_card_utils snd_soc_spdif_tx nvadsp max77620_thermal ina3221 snd_soc_tegra210_ahub nct1008 tegra_bpmp_thermal tegra210_adma
[ 176.557801] nvethernet userspace_alert nvmap snd_soc_sgtl5000 spi_tegra114 gw5200 max9295 max96712 max9296 max2008X spidev ip_tables x_tables [last unloaded: mtd]
[ 176.623151] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G OE 5.10.192 #3
[ 176.630759] Hardware name: Jetson-AGX (DT)
[ 176.634960] pstate: 00400089 (nzcv daIf +PAN -UAO -TCO BTYPE=–)
[ 176.641005] pc : percpu_ref_get_many+0x90/0xb0
[ 176.645482] lr : percpu_ref_get_many+0x88/0xb0
[ 176.649663] sp : ffff80001002bc60
[ 176.653076] x29: ffff80001002bc60 x28: 00000000fffffff8
[ 176.658588] x27: ffff69fad328fa80 x26: ffff69f8c00d5900
[ 176.664357] x25: ffff80001002bdb0 x24: ffffc144777cf000
[ 176.669870] x23: 0000020000200000 x22: 0000000000000000
[ 176.675381] x21: 00000000000000c8 x20: fffffffffdfbbf3f
[ 176.680896] x19: 0000000000000001 x18: 0000000000000000
[ 176.686427] x17: 0000000000000000 x16: 0000000000000000
[ 176.691920] x15: 0000000000000000 x14: 0000000000000000
[ 176.697435] x13: 0000000000000001 x12: 0000000000000500
[ 176.702772] x11: 0000000000000001 x10: 0000000000000001
[ 176.708282] x9 : 0000070c7ca44a48 x8 : ffff69f8c02a3b00
[ 176.713794] x7 : ffff000000000000 x6 : 0000000055555556
[ 176.719049] x5 : 0000000000000001 x4 : 0000000000000007
[ 176.724734] x3 : ffffffffffffffff x2 : ffffc14475a00390
[ 176.729811] x1 : ffff69f8c02a3b00 x0 : 0000000000000001
[ 176.735408] Call trace:
[ 176.737612] percpu_ref_get_many+0x90/0xb0
[ 176.741633] refill_obj_stock+0x64/0xf0
[ 176.745569] obj_cgroup_uncharge+0x2c/0x40
[ 176.749593] memcg_slab_free_hook+0xf4/0x2a0
[ 176.754155] kmem_cache_free+0x108/0x430
[ 176.757644] put_cred_rcu+0xb8/0x150
[ 176.761576] rcu_core+0x288/0xa10
[ 176.764581] rcu_core_si+0x18/0x20
[ 176.768052] __do_softirq+0x140/0x3e8
[ 176.771475] irq_exit+0xc0/0xe0
[ 176.774448] __handle_domain_irq+0x74/0xd0
[ 176.778647] efi_header_end+0xb0/0xf0
[ 176.782403] el1_irq+0xd0/0x180
[ 176.785120] cpuidle_enter_state+0xb8/0x410
[ 176.789321] cpuidle_enter+0x40/0x60
[ 176.792821] call_cpuidle+0x44/0x80
[ 176.796596] do_idle+0x208/0x270
[ 176.799732] cpu_startup_entry+0x30/0x60
[ 176.803577] secondary_start_kernel+0x15c/0x180
[ 176.807962] Code: f9400694 97fff32f 72001c1f 54000060 (f833029f)
[ 176.814431] —[ end trace ff79a4e3ad57fae7 ]—
[ 176.829220] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[ 176.829467] SMP: stopping secondary CPUs
[ 176.829600] Kernel Offset: 0x414465840000 from 0xffff800010000000
[ 176.836029] PHYS_OFFSET: 0xffff960840000000
[ 176.840234] CPU features: 0x48240002,03802a30
[ 176.844601] Memory Limit: none
[ 176.855942] Rebooting in 30 seconds..
我们另一台xavier上某一次出现的panic的堆栈打印跟上面的日志类似,但是当时没有出现gst-plugin-scan相关的异常信息:
[ 20.920981] mttcan c310000.mttcan can0: bitrate error 0.2%
[ 20.937258] mttcan c320000.mttcan can1: bitrate error 0.2%
[ 22.607580] Bridge firewalling registered
[ 28.894232] nvidia: loading out-of-tree module taints kernel.
[ 28.896714] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 343.214580] nvmap_alloc_handle: PID 16968: drivers_camera_: WARNING: All NvMap Allocations must have a tag to identify the subsystem allocating memory.Please pass the tag to the API call NvRmMemHanldeAllocAttr() or relevant.
[ 344.170640] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 0, flags: 0, err_data 131072, datatype 1e
[ 9003.700995] Unable to handle kernel paging request at virtual address 007b23676f6c2d79
[ 9003.701208] Mem abort info:
[ 9003.701299] ESR = 0x96000021
[ 9003.701389] EC = 0x25: DABT (current EL), IL = 32 bits
[ 9003.701508] SET = 0, FnV = 0
[ 9003.701577] EA = 0, S1PTW = 0
[ 9003.701644] Data abort info:
[ 9003.701704] ISV = 0, ISS = 0x00000021
[ 9003.701787] CM = 0, WnR = 0
[ 9003.701853] [007b23676f6c2d79] address between user and kernel address ranges
[ 9003.702020] Internal error: Oops: 96000021 [#1] PREEMPT SMP
[ 9003.702140] Modules linked in: xt_nat fuse xt_tcpudp veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c xt_addrtype iptable_filter br_netfilter switch mttcan can_dev can_raw can lzo_rle lzo_compress zram overlay ramoops reed_solomon mv88e6xxx dsa_core loop camera ox03c10 imx390 nvgpu snd_soc_tegra186_asrc snd_soc_tegra210_ope snd_soc_tegra186_dspk snd_soc_tegra210_iqc snd_soc_tegra186_arad snd_soc_tegra210_mvc snd_soc_tegra210_afc snd_soc_tegra210_dmic snd_soc_tegra210_adx snd_soc_tegra210_amx snd_soc_tegra210_mixer snd_soc_tegra210_i2s snd_soc_tegra210_sfc snd_soc_tegra210_admaif snd_soc_tegra_pcm aes_ce_blk crypto_simd cryptd aes_ce_cipher ghash_ce sha2_ce sha256_arm64 sha1_ce snd_soc_tegra210_adsp snd_soc_tegra_machine_driver binfmt_misc snd_soc_tegra_utils snd_soc_simple_card_utils pwm_fan snd_soc_spdif_tx nct1008 ina3221 nvadsp max77620_thermal snd_soc_tegra210_ahub nvethernet tegra_bpmp_thermal nvmap
[ 9003.702363] userspace_alert tegra210_adma snd_soc_sgtl5000 spi_tegra114 gw5200 max9295 max96712 max9296 max2008X spidev ip_tables x_tables [last unloaded: mtd]
[ 9003.759317] CPU: 4 PID: 34 Comm: ksoftirqd/4 Tainted: G OE 5.10.192 #3
[ 9003.767431] Hardware name: Jetson-AGX (DT)
[ 9003.771371] pstate: 00c00089 (nzcv daIf +PAN +UAO -TCO BTYPE=–)
[ 9003.777439] pc : percpu_ref_get_many+0x90/0xb0
[ 9003.782130] lr : percpu_ref_get_many+0x88/0xb0
[ 9003.786326] sp : ffff8000101fbb10
[ 9003.790005] x29: ffff8000101fbb10 x28: 00000000fffffff8
[ 9003.795518] x27: ffff34471a06cf80 x26: ffff3445000d5000
[ 9003.801037] x25: ffff8000101fbc60 x24: ffffd3d7090af000
[ 9003.806277] x23: 0000020000200000 x22: 0000000000000000
[ 9003.812050] x21: 00000000000000c8 x20: 227b23676f6c2d79
[ 9003.817560] x19: 0000000000000001 x18: 0000000000000000
[ 9003.822828] x17: 0000000000000000 x16: ffffd3d70715360c
[ 9003.828328] x15: 0000000000000000 x14: 0000000000000000
[ 9003.834100] x13: fffffed115bd7008 x12: ffffd3d7090aec08
[ 9003.839213] x11: 0000000000177700 x10: 0000000000000640
[ 9003.844692] x9 : ffff6075772dc000 x8 : ffff344500348ec0
[ 9003.850463] x7 : ffff000000000000 x6 : 0000000055555556
[ 9003.855981] x5 : 0000000000000001 x4 : 0000000000000007
[ 9003.861140] x3 : ffffffffffffffff x2 : ffffd3d7072e0390
[ 9003.866478] x1 : ffff344500348ec0 x0 : 0000000000000001
[ 9003.872104] Call trace:
[ 9003.874275] percpu_ref_get_many+0x90/0xb0
[ 9003.878314] refill_obj_stock+0x64/0xf0
[ 9003.882497] obj_cgroup_uncharge+0x2c/0x40
[ 9003.886514] memcg_slab_free_hook+0xf4/0x2a0
[ 9003.890565] kmem_cache_free+0x108/0x430
[ 9003.894573] __d_free+0x2c/0x40
[ 9003.897483] rcu_core+0x288/0xa10
[ 9003.900694] rcu_core_si+0x18/0x20
[ 9003.904375] __do_softirq+0x140/0x3e8
[ 9003.907873] run_ksoftirqd+0x50/0x60
[ 9003.911375] smpboot_thread_fn+0x1c4/0x280
[ 9003.915309] kthread+0x148/0x170
[ 9003.918813] ret_from_fork+0x10/0x18
[ 9003.922049] Code: f9400694 97fff32f 72001c1f 54000060 (f833029f)
[ 9003.928431] —[ end trace f36d791e75292573 ]—
[ 9003.945261] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[ 9003.945471] SMP: stopping secondary CPUs
[ 9003.945598] Kernel Offset: 0x53d6f7120000 from 0xffff800010000000
[ 9003.949768] PHYS_OFFSET: 0xffffcbbc00000000
[ 9003.953882] CPU features: 0x48240002,03802a30
[ 9003.958168] Memory Limit: none
[ 9003.972907] Rebooting in 30 seconds..
panic_2025-01-08_15_25_13.tar.gz (27.8 KB)
It is not possible to tell what is wrong by just these commands. You need to provide a method that can reproduce on NV devkit so that we can look into it.
Or try to narrow down what is the application that causing error on your side first.
好的,因为是小概率问题,很难找到复现问题的方法,我们先禁用gst-plugin-scan复现一下;
另外想请教个问题,在我们个别设备上有看到gnu相关错误日志,我们的nvidia*.so 使用的是对应JP5.1.3的Linux For Tefra中的so,我们是否需要下载NVIDIA-kernel-module-source-TempVersion自己编译nvidia*so。 什么情况下我们需要自己编译这些so
Not nvidia.*so. They are nvidia-xxx .ko.
If you update kernel image, then you should update kernel modules too in case there are mismatch problem. It is just common operations when replacing kernel. Not just because you are using NVIDIA driver.
好的,感谢!关于您说的 update kernel image,想确认下,您说的update是指以下两种情况都需要同步编译更新 nvidia-xxx.ko 吗?
第一种是在JP5.1.3官方release的内核源码中,修改部分设备树、内核代码以及部分编译选项,或者新增一些自有驱动模块
第二种是升级到更高版本的内核版本
@WayneWWW 如上问题,还请确认下是否两种情况都需要编译 nvidia-xxx .ko
kernel image /kernel modules/ kernel dtb三個基本上是一套的
你更新了kernel image之後也得把kernel modules (.ko)一起更新, 不然有機會那些kernel modules無法probe.
kernel dtb需要更改的頻率沒有modules高. 通常不用動