I had previously created another thread here: Intermittent Kernel Crash Following Upgrade to Jetpack 5.1.1
I am creating a new thread after we have confirmed this issue occurs on our devkit boards as well as the MIC-730ai device and after consulting with Advantech. Creating this thread per their recommendation.
The issue has been replicated on AGX Xavier Devkits as well as the MIC-730ai’s with both jetpack 5.1 and 5.1.1. I have been unable to replicate this issue with a minimal setup.
We are running a k3s cluster on 5 devices which includes 5 instances of our python based machine learning application (1 on each device), 2 java applications, 2 apps written in golang, a rabbitmq cluster, minio cluster and redis cluster (all in k3s containers). Our pipeline will be running fine for hours and then encounter a kernel panic. This kernel panic does not appear to be associated with either low memory or high cpu usage.
One odd thing I noticed is that “nvgpu_finalize_poweron” is erroring out… but it is unclear why this would even be getting called hours after the device has started using the gpu.
With the upgrade to jetpack 5.1.x we were also forced to upgrade from tensorflow 1.x to 2.x.
I have not yet been able to collect the full kernel message from the devkits (message is truncated due to kernel panic when tailing dmesg). But I have done so on the MIC-730AI’s. Here is one example:
[ 8376.546639] nvgpu: 17000000.gv11b nvgpu_timeout_expired_msg_cpu:94 [ERR] Timeout detected @ nvgpu_pmu_wait_fw_ack_status+0xbc/0x130 [nvgpu]
[ 8376.547054] nvgpu: 17000000.gv11b nvgpu_pmu_wait_fw_ready:167 [ERR] PMU is not ready yet
[ 8376.547283] nvgpu: 17000000.gv11b lsfm_int_wpr_region:65 [ERR] PMU not ready to process requests
[ 8376.547533] nvgpu: 17000000.gv11b nvgpu_pmu_lsfm_bootstrap_ls_falcon:107 [ERR] LSF init WPR region failed
[ 8376.547799] nvgpu: 17000000.gv11b nvgpu_pmu_lsfm_bootstrap_ls_falcon:128 [ERR] LSF Load failed
[ 8376.548010] nvgpu: 17000000.gv11b nvgpu_gr_falcon_load_secure_ctxsw_ucode:727 [ERR] Unable to boot GPCCS
[ 8376.548254] nvgpu: 17000000.gv11b nvgpu_gr_falcon_init_ctxsw:159 [ERR] fail
[ 8376.548424] nvgpu: 17000000.gv11b nvgpu_report_err_to_sdl:66 [ERR] Failed to report an error: hw_unit_id = 0x2, err_id=0x6, ss_err_id = 0x262
[ 8376.548706] nvgpu: 17000000.gv11b gr_init_ctxsw_falcon_support:833 [ERR] FECS context switch init error
[ 8376.549333] nvgpu: 17000000.gv11b nvgpu_finalize_poweron:1010 [ERR] Failed initialization for: g->ops.gr.gr_init_support
[ 8376.588380] nvgpu: 17000000.gv11b nvgpu_gpu_set_deterministic_ch_railgate:1858 [WRN] cannot busy to restore deterministic ch
[ 8376.753465] CPU:0, Error: cbb-noc@2300000, irq=14
[ 8376.753592] **************************************
[ 8376.753717] CPU:0, Error:cbb-noc
[ 8376.753791] Error Logger : 1
[ 8376.753868] ErrLog0 : 0x80030608
[ 8376.753967] Transaction Type : WR - Write, Incrementing
[ 8376.754077] Error Code : TMO
[ 8376.754145] Error Source : Target NIU
[ 8376.754227] Error Description : Target time-out error
[ 8376.754334] Packet header Lock : 0
[ 8376.754417] Packet header Len1 : 3
[ 8376.754496] NOC protocol version : version >= 2.7
[ 8376.754600] ErrLog1 : 0x340013
[ 8376.754670] ErrLog2 : 0x0
[ 8376.754728] RouteId : 0x340013
[ 8376.754797] InitFlow : ccroc_p2ps/I/ccroc_p2ps
[ 8376.754892] Targflow : gpu_p2pm/T/gpu_p2pm
[ 8376.754992] TargSubRange : 0
[ 8376.755060] SeqId : 0
[ 8376.755118] ErrLog3 : 0x810090
[ 8376.755186] ErrLog4 : 0x0
[ 8376.755405] Address accessed : 0x17810090
[ 8376.755708] ErrLog5 : 0x489f850
[ 8376.755940] Non-Modify : 0x1
[ 8376.756168] AXI ID : 0x9
[ 8376.757685] Master ID : CCPLEX
[ 8376.760666] Security Group(GRPSEC): 0x7e
[ 8376.765123] Cache : 0x0 -- Device Non-Bufferable
[ 8376.770201] Protection : 0x2 -- Unprivileged, Non-Secure, Data Access
[ 8376.776784] FALCONSEC : 0x0
[ 8376.779916] Virtual Queuing Channel(VQC): 0x0
[ 8376.784555] **************************************
[ 8376.789859] kernel BUG at drivers/soc/tegra/cbb/tegra194-cbb.c:2057!
[ 8376.795937] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[ 8376.801524] Modules linked in: veth(E) xt_recent(E) xt_statistic(E) xt_nat(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_tcpudp(E) xt_mark(E) xt_comment(E) ip6table_filter(E) ip6table_mangle(E) iptable_mangle(E) ip6table_nat(E) ip6_tables(E) ip_vs_rr(E) ip_vs(E) fuse(E) xt_conntrack(E) xt_MASQUERADE(E) nf_conntrack_netlink(E) nfnetlink(E) iptable_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) xt_addrtype(E) iptable_filter(E) lzo_rle(E) lzo_compress(E) zram(E) br_netfilter(E) overlay(E) ramoops(E) reed_solomon(E) loop(E) snd_soc_tegra186_asrc(E) snd_soc_tegra186_dspk(E) snd_soc_tegra210_ope(E) snd_soc_tegra186_arad(E) snd_soc_tegra210_iqc(E) snd_soc_tegra210_mvc(E) snd_soc_tegra210_admaif(E) snd_soc_tegra210_dmic(E) snd_soc_tegra_pcm(E) snd_soc_tegra210_afc(E) snd_soc_tegra210_adx(E) snd_soc_tegra210_amx(E) snd_soc_tegra210_mixer(E) snd_soc_tegra210_i2s(E) snd_soc_tegra210_sfc(E) aes_ce_blk(E) crypto_simd(E) cryptd(E) aes_ce_cipher(E) ghash_ce(E) sha2_ce(E)
[ 8376.801765] sha256_arm64(E) binfmt_misc(E) sha1_ce(E) snd_soc_spdif_tx(E) max77620_thermal(E) snd_soc_tegra210_adsp(E) nct1008(E) tegra_bpmp_thermal(E) snd_soc_tegra_utils(E) bluedroid_pm(E) snd_soc_simple_card_utils(E) snd_soc_tegra210_ahub(E) nvadsp(E) tegra210_adma(E) userspace_alert(E) snd_hda_codec_hdmi(E) snd_hda_tegra(E) snd_hda_codec(E) snd_hda_core(E) spi_tegra114(E) ina3221(E) pwm_fan(E) nvgpu(E) nvmap(E) igb(OE) ip_tables(E) x_tables(E) ahci(E) libahci(E) libata(E) [last unloaded: mtd]
[ 8376.932447] CPU: 0 PID: 129633 Comm: systemd-udevd Tainted: G OE 5.10.104-tegra #2
[ 8376.941348] Hardware name: Unknown Jetson-AGX/Jetson-AGX, BIOS 3.1-32827747 03/19/2023
[ 8376.949235] pstate: 60400089 (nZCv daIf +PAN -UAO -TCO BTYPE=--)
[ 8376.955542] pc : tegra194_cbb_err_isr+0x19c/0x1b0
[ 8376.959992] lr : tegra194_cbb_err_isr+0x11c/0x1b0
[ 8376.964985] sp : ffff800010003b40
[ 8376.968385] x29: ffff800010003b40 x28: 0000000000000001
[ 8376.973639] x27: 0000000000000080 x26: ffffbf0033cc93d8
[ 8376.979150] x25: ffffbf003462de38 x24: 0000000000000001
[ 8376.984681] x23: ffffbf0033fb7000 x22: ffffbf003444ef30
[ 8376.990434] x21: 000000000000000e x20: 0000000000000006
[ 8376.995961] x19: ffffbf003444ef20 x18: 0000000000000010
[ 8377.001459] x17: 393934353336323d x16: ffffbf00325d3210
[ 8377.006972] x15: ffff0b5e43fe3f70 x14: ffffffffffffffff
[ 8377.012396] x13: ffff800090003737 x12: ffff80001000373f
[ 8377.017476] x11: 0000000000000038 x10: 0101010101010101
[ 8377.023258] x9 : ffff800010003a50 x8 : 2a2a2a2a2a2a2a2a
[ 8377.028502] x7 : 2a2a2a2a2a2a2a09 x6 : c0000000ffffefff
[ 8377.034014] x5 : ffff0b65bfd1c958 x4 : ffffbf00342d7968
[ 8377.039438] x3 : 0000000000000001 x2 : ffffbf003276e170
[ 8377.045033] x1 : ffff0b5e43fe3a00 x0 : 0000000000010101
[ 8377.050371] Call trace:
[ 8377.052827] tegra194_cbb_err_isr+0x19c/0x1b0
[ 8377.056874] __handle_irq_event_percpu+0x68/0x2a0
[ 8377.061579] handle_irq_event_percpu+0x40/0xa0
[ 8377.066146] handle_irq_event+0x50/0xf0
[ 8377.069886] handle_fasteoi_irq+0xc0/0x170
[ 8377.073670] generic_handle_irq+0x40/0x60
[ 8377.077939] __handle_domain_irq+0x70/0xd0
[ 8377.082138] efi_header_end+0xb0/0xf0
[ 8377.085636] el1_irq+0xd0/0x180
[ 8377.088611] __do_softirq+0xb4/0x3e8
[ 8377.092289] irq_exit+0xc0/0xe0
[ 8377.095021] __handle_domain_irq+0x74/0xd0
[ 8377.099029] efi_header_end+0xb0/0xf0
[ 8377.102960] el1_irq+0xd0/0x180
[ 8377.105681] ptep_clear_flush+0xcc/0xd0
[ 8377.109618] wp_page_copy+0x1d4/0xbe0
[ 8377.113133] do_wp_page+0xa0/0x530
[ 8377.116611] handle_mm_fault+0xa30/0x1000
[ 8377.120652] do_page_fault+0x118/0x400
[ 8377.124487] do_mem_abort+0x54/0xb0
[ 8377.128167] el0_da+0x38/0x50
[ 8377.131055] el0_sync_handler+0x80/0xb0
[ 8377.135163] el0_sync+0x16c/0x180
[ 8377.138322] Code: a9446bf9 a94573fb a8c67bfd d65f03c0 (d4210000)
[ 8377.144618] ---[ end trace 057ee2343e7709db ]---
[ 8377.159765] Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt
[ 8377.160000] SMP: stopping secondary CPUs
[ 8377.160479] Kernel Offset: 0x3f00225b0000 from 0xffff800010000000
[ 8377.166929] PHYS_OFFSET: 0xfffff4a2c0000000
[ 8377.170889] CPU features: 0x8240002,03802a30
[ 8377.175501] Memory Limit: none
[ 8377.187070] Rebooting in 10 seconds..
And another (which is less common as it shows systemd-udevd as the running process instead of python):
[35758.793055] nvgpu: 17000000.gv11b nvgpu_timeout_expired_msg_cpu:94 [ERR] Timeout detected @ nvgpu_pmu_wait_fw_ack_status+0xbc/0x130 [nvgpu]
[35758.793447] nvgpu: 17000000.gv11b nvgpu_pmu_wait_fw_ready:167 [ERR] PMU is not ready yet
[35758.802595] nvgpu: 17000000.gv11b lsfm_int_wpr_region:65 [ERR] PMU not ready to process requests
[35758.803030] nvgpu: 17000000.gv11b nvgpu_pmu_lsfm_bootstrap_ls_falcon:107 [ERR] LSF init WPR region failed
[35758.805463] nvgpu: 17000000.gv11b nvgpu_pmu_lsfm_bootstrap_ls_falcon:128 [ERR] LSF Load failed
[35758.805752] nvgpu: 17000000.gv11b nvgpu_gr_falcon_load_secure_ctxsw_ucode:727 [ERR] Unable to boot GPCCS
[35758.806136] nvgpu: 17000000.gv11b nvgpu_gr_falcon_init_ctxsw:159 [ERR] fail
[35758.806389] nvgpu: 17000000.gv11b nvgpu_report_err_to_sdl:66 [ERR] Failed to report an error: hw_unit_id = 0x2, err_id=0x6, ss_err_id = 0x262
[35758.806789] nvgpu: 17000000.gv11b gr_init_ctxsw_falcon_support:833 [ERR] FECS context switch init error
[35758.807089] nvgpu: 17000000.gv11b nvgpu_finalize_poweron:1010 [ERR] Failed initialization for: g->ops.gr.gr_init_support
[35758.844682] nvgpu: 17000000.gv11b nvgpu_gpu_set_deterministic_ch_railgate:1858 [WRN] cannot busy to restore deterministic ch
[35759.009931] CPU:0, Error: cbb-noc@2300000, irq=14
[35759.010095] **************************************
[35759.010233] CPU:0, Error:cbb-noc
[35759.010309] Error Logger : 1
[35759.010385] ErrLog0 : 0x80030608
[35759.010460] Transaction Type : WR - Write, Incrementing
[35759.010569] Error Code : TMO
[35759.010640] Error Source : Target NIU
[35759.010723] Error Description : Target time-out error
[35759.010842] Packet header Lock : 0
[35759.010915] Packet header Len1 : 3
[35759.011015] NOC protocol version : version >= 2.7
[35759.011111] ErrLog1 : 0x340011
[35759.011179] ErrLog2 : 0x0
[35759.011237] RouteId : 0x340011
[35759.011306] InitFlow : ccroc_p2ps/I/ccroc_p2ps
[35759.011399] Targflow : gpu_p2pm/T/gpu_p2pm
[35759.011484] TargSubRange : 0
[35759.011576] SeqId : 0
[35759.011638] ErrLog3 : 0x810090
[35759.011706] ErrLog4 : 0x0
[35759.011862] Address accessed : 0x17810090
[35759.012168] ErrLog5 : 0x409f850
[35759.012409] Non-Modify : 0x1
[35759.012648] AXI ID : 0x8
[35759.014150] Master ID : CCPLEX
[35759.017136] Security Group(GRPSEC): 0x7e
[35759.021597] Cache : 0x0 -- Device Non-Bufferable
[35759.026668] Protection : 0x2 -- Unprivileged, Non-Secure, Data Access
[35759.033496] FALCONSEC : 0x0
[35759.036642] Virtual Queuing Channel(VQC): 0x0
[35759.041281] **************************************
[35759.046082] kernel BUG at drivers/soc/tegra/cbb/tegra194-cbb.c:2057!
[35759.052658] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[35759.057734] Modules linked in: veth(E) xt_recent(E) xt_statistic(E) xt_nat(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_tcpudp(E) ip6table_filter(E) ip6table_mangle(E) xt_comment(E) xt_mark(E) iptable_mangle(E) ip6table_nat(E) ip6_tables(E) ip_vs_rr(E) ip_vs(E) xt_conntrack(E) fuse(E) xt_MASQUERADE(E) nf_conntrack_netlink(E) nfnetlink(E) iptable_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) xt_addrtype(E) iptable_filter(E) lzo_rle(E) lzo_compress(E) zram(E) br_netfilter(E) overlay(E) ramoops(E) reed_solomon(E) loop(E) snd_soc_tegra186_dspk(E) snd_soc_tegra210_iqc(E) snd_soc_tegra186_asrc(E) snd_soc_tegra210_ope(E) snd_soc_tegra186_arad(E) snd_soc_tegra210_afc(E) snd_soc_tegra210_mvc(E) snd_soc_tegra210_dmic(E) snd_soc_tegra210_adx(E) snd_soc_tegra210_amx(E) snd_soc_tegra210_i2s(E) snd_soc_tegra210_admaif(E) snd_soc_tegra210_mixer(E) snd_soc_tegra_pcm(E) snd_soc_tegra210_sfc(E) aes_ce_blk(E) crypto_simd(E) cryptd(E) aes_ce_cipher(E) ghash_ce(E) sha2_ce(E)
[35759.058022] sha256_arm64(E) binfmt_misc(E) sha1_ce(E) snd_soc_spdif_tx(E) snd_soc_tegra210_adsp(E) snd_soc_tegra210_ahub(E) max77620_thermal(E) nct1008(E) snd_soc_tegra_utils(E) tegra_bpmp_thermal(E) snd_soc_simple_card_utils(E) bluedroid_pm(E) nvadsp(E) userspace_alert(E) tegra210_adma(E) snd_hda_codec_hdmi(E) snd_hda_tegra(E) snd_hda_codec(E) snd_hda_core(E) spi_tegra114(E) ina3221(E) pwm_fan(E) nvgpu(E) nvmap(E) igb(OE) ip_tables(E) x_tables(E) ahci(E) libahci(E) libata(E) [last unloaded: mtd]
[35759.188659] CPU: 0 PID: 4985 Comm: python3 Tainted: G OE 5.10.104-tegra #2
[35759.197025] Hardware name: Unknown Jetson-AGX/Jetson-AGX, BIOS 3.1-32827747 03/19/2023
[35759.204911] pstate: 60400089 (nZCv daIf +PAN -UAO -TCO BTYPE=--)
[35759.211212] pc : tegra194_cbb_err_isr+0x19c/0x1b0
[35759.215948] lr : tegra194_cbb_err_isr+0x11c/0x1b0
[35759.220649] sp : ffff800010003b40
[35759.224062] x29: ffff800010003b40 x28: 0000000000000001
[35759.229574] x27: 0000000000000080 x26: ffffcf0f42b193d8
[35759.235086] x25: ffffcf0f4347de38 x24: 0000000000000001
[35759.240344] x23: ffffcf0f42e07000 x22: ffffcf0f4329ef30
[35759.246111] x21: 000000000000000e x20: 0000000000000006
[35759.251623] x19: ffffcf0f4329ef20 x18: 0000000000000010
[35759.257136] x17: 0000000000000000 x16: ffffcf0f41423210
[35759.262670] x15: ffff2b64ef3f13f0 x14: ffffffffffffffff
[35759.268162] x13: ffff800090003737 x12: ffff800010003740
[35759.273498] x11: 0000000000000038 x10: 0101010101010101
[35759.279018] x9 : ffff800010003a50 x8 : 2a2a2a2a2a2a2a2a
[35759.284523] x7 : 2a2a2a2a2a2a2a09 x6 : c0000000ffffefff
[35759.289778] x5 : ffff2b6bbfd20958 x4 : ffffcf0f43127968
[35759.295462] x3 : 0000000000000001 x2 : ffffcf0f415be170
[35759.300814] x1 : ffff2b64ef3f0e80 x0 : 0000000000010100
[35759.305879] Call trace:
[35759.308350] tegra194_cbb_err_isr+0x19c/0x1b0
[35759.312884] __handle_irq_event_percpu+0x68/0x2a0
[35759.317429] handle_irq_event_percpu+0x40/0xa0
[35759.321891] handle_irq_event+0x50/0xf0
[35759.325653] handle_fasteoi_irq+0xc0/0x170
[35759.329436] generic_handle_irq+0x40/0x60
[35759.333702] __handle_domain_irq+0x70/0xd0
[35759.337667] efi_header_end+0xb0/0xf0
[35759.341403] el1_irq+0xd0/0x180
[35759.344119] __do_softirq+0xb4/0x3e8
[35759.347816] irq_exit+0xc0/0xe0
[35759.351027] __handle_domain_irq+0x74/0xd0
[35759.354793] efi_header_end+0xb0/0xf0
[35759.358741] el0_irq_naked+0x4c/0x54
[35759.362231] Code: a9446bf9 a94573fb a8c67bfd d65f03c0 (d4210000)
[35759.368363] ---[ end trace 4307a82fce3b412d ]---
[35759.383501] Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt
[35759.383708] SMP: stopping secondary CPUs
[35759.384205] Kernel Offset: 0x4f0f31400000 from 0xffff800010000000
[35759.390487] PHYS_OFFSET: 0xffffd49cc0000000
[35759.394347] CPU features: 0x8240002,03802a30
[35759.398798] Memory Limit: none
[35759.410248] Rebooting in 10 seconds..
ÿäÿâShutdown state requested 1
Rebooting system ...
And here is what I managed to capture on a AGX Xavier devkit using dmesg over ssh:
[Wed Jan 24 17:22:45 2024] nvgpu: 17000000.gv11b nvgpu_timeout_expired_msg_cpu:94 [ERR] Timeout detected @ nvgpu_pmu_wait_fw_ack_status+0xbc/0x130 [nvgpu]
[Wed Jan 24 17:22:45 2024] nvgpu: 17000000.gv11b nvgpu_pmu_wait_fw_ready:167 [ERR] PMU is not ready yet
[Wed Jan 24 17:22:45 2024] nvgpu: 17000000.gv11b lsfm_int_wpr_region:65 [ERR] PMU not ready to process requests
[Wed Jan 24 17:22:45 2024] nvgpu: 17000000.gv11b nvgpu_pmu_lsfm_bootstrap_ls_falcon:107 [ERR] LSF init WPR region failed
[Wed Jan 24 17:22:45 2024] nvgpu: 17000000.gv11b nvgpu_pmu_lsfm_bootstrap_ls_falcon:128 [ERR] LSF Load failed
[Wed Jan 24 17:22:45 2024] nvgpu: 17000000.gv11b nvgpu_gr_falcon_load_secure_ctxsw_ucode:727 [ERR] Unable to boot GPCCS
[Wed Jan 24 17:22:45 2024] nvgpu: 17000000.gv11b nvgpu_gr_falcon_init_ctxsw:159 [ERR] fail
[Wed Jan 24 17:22:45 2024] nvgpu: 17000000.gv11b nvgpu_report_err_to_sdl:66 [ERR] Failed to report an error: hw_unit_id = 0x2, err_id=0x6, ss_err_id = 0x262
[Wed Jan 24 17:22:45 2024] nvgpu: 17000000.gv11b gr_init_ctxsw_falcon_support:833 [ERR] FECS context switch init error
[Wed Jan 24 17:22:45 2024] nvgpu: 17000000.gv11b nvgpu_finalize_poweron:1010 [ERR] Failed initialization for: g->ops.gr.gr_init_support
[Wed Jan 24 17:22:45 2024] nvgpu: 17000000.gv11b nvgpu_gpu_set_deterministic_ch_railgate:1858 [WRN] cannot busy to restore deterministic ch
[Wed Jan 24 17:22:45 2024] CPU:0, Error: cbb-noc@2300000, irq=14
[Wed Jan 24 17:22:45 2024] **************************************
[Wed Jan 24 17:22:45 2024] CPU:0, Error:cbb-noc
[Wed Jan 24 17:22:45 2024] Error Logger : 1
[Wed Jan 24 17:22:45 2024] ErrLog0 : 0x80030608
[Wed Jan 24 17:22:45 2024] Transaction Type : WR - Write, Incrementing
[Wed Jan 24 17:22:45 2024] Error Code : TMO
[Wed Jan 24 17:22:45 2024] Error Source : Target NIU
[Wed Jan 24 17:22:45 2024] Error Description : Target time-out error
[Wed Jan 24 17:22:45 2024] Packet header Lock : 0
[Wed Jan 24 17:22:45 2024] Packet header Len1 : 3
[Wed Jan 24 17:22:45 2024] NOC protocol version : version >= 2.7
Also we occasionally see the following kernel message (time varies):
hrtimer: interrupt took 75648 ns
I would like to formulate a minimal test to replicate the issue, but it sort of seems like its related to a race condition and for whatever reason our particular setup has just maximized the probability of the race condition occurring. I have been unable to replicate it with just a single node running the machine learning app and a rabbitmq instance to feed the app, I even attempted to do this while using stress (https://linux.die.net/man/1/stress) with various flags to no avail.