Kernel Crashing Following Upgrade to Jetpack 5.1/5.1.1

I had previously created another thread here: Intermittent Kernel Crash Following Upgrade to Jetpack 5.1.1

I am creating a new thread after we have confirmed this issue occurs on our devkit boards as well as the MIC-730ai device and after consulting with Advantech. Creating this thread per their recommendation.

The issue has been replicated on AGX Xavier Devkits as well as the MIC-730ai’s with both jetpack 5.1 and 5.1.1. I have been unable to replicate this issue with a minimal setup.

We are running a k3s cluster on 5 devices which includes 5 instances of our python based machine learning application (1 on each device), 2 java applications, 2 apps written in golang, a rabbitmq cluster, minio cluster and redis cluster (all in k3s containers). Our pipeline will be running fine for hours and then encounter a kernel panic. This kernel panic does not appear to be associated with either low memory or high cpu usage.

One odd thing I noticed is that “nvgpu_finalize_poweron” is erroring out… but it is unclear why this would even be getting called hours after the device has started using the gpu.

With the upgrade to jetpack 5.1.x we were also forced to upgrade from tensorflow 1.x to 2.x.

I have not yet been able to collect the full kernel message from the devkits (message is truncated due to kernel panic when tailing dmesg). But I have done so on the MIC-730AI’s. Here is one example:

[ 8376.546639] nvgpu: 17000000.gv11b     nvgpu_timeout_expired_msg_cpu:94   [ERR]  Timeout detected @ nvgpu_pmu_wait_fw_ack_status+0xbc/0x130 [nvgpu] 
[ 8376.547054] nvgpu: 17000000.gv11b           nvgpu_pmu_wait_fw_ready:167  [ERR]  PMU is not ready yet
[ 8376.547283] nvgpu: 17000000.gv11b               lsfm_int_wpr_region:65   [ERR]  PMU not ready to process requests
[ 8376.547533] nvgpu: 17000000.gv11b nvgpu_pmu_lsfm_bootstrap_ls_falcon:107  [ERR]  LSF init WPR region failed
[ 8376.547799] nvgpu: 17000000.gv11b nvgpu_pmu_lsfm_bootstrap_ls_falcon:128  [ERR]  LSF Load failed
[ 8376.548010] nvgpu: 17000000.gv11b nvgpu_gr_falcon_load_secure_ctxsw_ucode:727  [ERR]  Unable to boot GPCCS
[ 8376.548254] nvgpu: 17000000.gv11b        nvgpu_gr_falcon_init_ctxsw:159  [ERR]  fail
[ 8376.548424] nvgpu: 17000000.gv11b           nvgpu_report_err_to_sdl:66   [ERR]  Failed to report an error: hw_unit_id = 0x2, err_id=0x6, ss_err_id = 0x262
[ 8376.548706] nvgpu: 17000000.gv11b      gr_init_ctxsw_falcon_support:833  [ERR]  FECS context switch init error
[ 8376.549333] nvgpu: 17000000.gv11b            nvgpu_finalize_poweron:1010 [ERR]  Failed initialization for: g->ops.gr.gr_init_support
[ 8376.588380] nvgpu: 17000000.gv11b nvgpu_gpu_set_deterministic_ch_railgate:1858 [WRN]  cannot busy to restore deterministic ch
[ 8376.753465] CPU:0, Error: cbb-noc@2300000, irq=14
[ 8376.753592] **************************************
[ 8376.753717] CPU:0, Error:cbb-noc
[ 8376.753791] 	Error Logger		: 1
[ 8376.753868] 	ErrLog0			: 0x80030608
[ 8376.753967] 	  Transaction Type	: WR  - Write, Incrementing
[ 8376.754077] 	  Error Code		: TMO
[ 8376.754145] 	  Error Source		: Target NIU
[ 8376.754227] 	  Error Description	: Target time-out error
[ 8376.754334] 	  Packet header Lock	: 0
[ 8376.754417] 	  Packet header Len1	: 3
[ 8376.754496] 	  NOC protocol version	: version >= 2.7
[ 8376.754600] 	ErrLog1			: 0x340013
[ 8376.754670] 	ErrLog2			: 0x0
[ 8376.754728] 	  RouteId		: 0x340013
[ 8376.754797] 	  InitFlow		: ccroc_p2ps/I/ccroc_p2ps
[ 8376.754892] 	  Targflow		: gpu_p2pm/T/gpu_p2pm
[ 8376.754992] 	  TargSubRange		: 0
[ 8376.755060] 	  SeqId			: 0
[ 8376.755118] 	ErrLog3			: 0x810090
[ 8376.755186] 	ErrLog4			: 0x0
[ 8376.755405] 	  Address accessed	: 0x17810090
[ 8376.755708] 	ErrLog5			: 0x489f850
[ 8376.755940] 	  Non-Modify		: 0x1
[ 8376.756168] 	  AXI ID		: 0x9
[ 8376.757685] 	  Master ID		: CCPLEX
[ 8376.760666] 	  Security Group(GRPSEC): 0x7e
[ 8376.765123] 	  Cache			: 0x0 -- Device Non-Bufferable
[ 8376.770201] 	  Protection		: 0x2 -- Unprivileged, Non-Secure, Data Access
[ 8376.776784] 	  FALCONSEC		: 0x0
[ 8376.779916] 	  Virtual Queuing Channel(VQC): 0x0
[ 8376.784555] 	**************************************
[ 8376.789859] kernel BUG at drivers/soc/tegra/cbb/tegra194-cbb.c:2057!
[ 8376.795937] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[ 8376.801524] Modules linked in: veth(E) xt_recent(E) xt_statistic(E) xt_nat(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_tcpudp(E) xt_mark(E) xt_comment(E) ip6table_filter(E) ip6table_mangle(E) iptable_mangle(E) ip6table_nat(E) ip6_tables(E) ip_vs_rr(E) ip_vs(E) fuse(E) xt_conntrack(E) xt_MASQUERADE(E) nf_conntrack_netlink(E) nfnetlink(E) iptable_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) xt_addrtype(E) iptable_filter(E) lzo_rle(E) lzo_compress(E) zram(E) br_netfilter(E) overlay(E) ramoops(E) reed_solomon(E) loop(E) snd_soc_tegra186_asrc(E) snd_soc_tegra186_dspk(E) snd_soc_tegra210_ope(E) snd_soc_tegra186_arad(E) snd_soc_tegra210_iqc(E) snd_soc_tegra210_mvc(E) snd_soc_tegra210_admaif(E) snd_soc_tegra210_dmic(E) snd_soc_tegra_pcm(E) snd_soc_tegra210_afc(E) snd_soc_tegra210_adx(E) snd_soc_tegra210_amx(E) snd_soc_tegra210_mixer(E) snd_soc_tegra210_i2s(E) snd_soc_tegra210_sfc(E) aes_ce_blk(E) crypto_simd(E) cryptd(E) aes_ce_cipher(E) ghash_ce(E) sha2_ce(E)
[ 8376.801765]  sha256_arm64(E) binfmt_misc(E) sha1_ce(E) snd_soc_spdif_tx(E) max77620_thermal(E) snd_soc_tegra210_adsp(E) nct1008(E) tegra_bpmp_thermal(E) snd_soc_tegra_utils(E) bluedroid_pm(E) snd_soc_simple_card_utils(E) snd_soc_tegra210_ahub(E) nvadsp(E) tegra210_adma(E) userspace_alert(E) snd_hda_codec_hdmi(E) snd_hda_tegra(E) snd_hda_codec(E) snd_hda_core(E) spi_tegra114(E) ina3221(E) pwm_fan(E) nvgpu(E) nvmap(E) igb(OE) ip_tables(E) x_tables(E) ahci(E) libahci(E) libata(E) [last unloaded: mtd]
[ 8376.932447] CPU: 0 PID: 129633 Comm: systemd-udevd Tainted: G           OE     5.10.104-tegra #2
[ 8376.941348] Hardware name: Unknown Jetson-AGX/Jetson-AGX, BIOS 3.1-32827747 03/19/2023
[ 8376.949235] pstate: 60400089 (nZCv daIf +PAN -UAO -TCO BTYPE=--)
[ 8376.955542] pc : tegra194_cbb_err_isr+0x19c/0x1b0
[ 8376.959992] lr : tegra194_cbb_err_isr+0x11c/0x1b0
[ 8376.964985] sp : ffff800010003b40
[ 8376.968385] x29: ffff800010003b40 x28: 0000000000000001 
[ 8376.973639] x27: 0000000000000080 x26: ffffbf0033cc93d8 
[ 8376.979150] x25: ffffbf003462de38 x24: 0000000000000001 
[ 8376.984681] x23: ffffbf0033fb7000 x22: ffffbf003444ef30 
[ 8376.990434] x21: 000000000000000e x20: 0000000000000006 
[ 8376.995961] x19: ffffbf003444ef20 x18: 0000000000000010 
[ 8377.001459] x17: 393934353336323d x16: ffffbf00325d3210 
[ 8377.006972] x15: ffff0b5e43fe3f70 x14: ffffffffffffffff 
[ 8377.012396] x13: ffff800090003737 x12: ffff80001000373f 
[ 8377.017476] x11: 0000000000000038 x10: 0101010101010101 
[ 8377.023258] x9 : ffff800010003a50 x8 : 2a2a2a2a2a2a2a2a 
[ 8377.028502] x7 : 2a2a2a2a2a2a2a09 x6 : c0000000ffffefff 
[ 8377.034014] x5 : ffff0b65bfd1c958 x4 : ffffbf00342d7968 
[ 8377.039438] x3 : 0000000000000001 x2 : ffffbf003276e170 
[ 8377.045033] x1 : ffff0b5e43fe3a00 x0 : 0000000000010101 
[ 8377.050371] Call trace:
[ 8377.052827]  tegra194_cbb_err_isr+0x19c/0x1b0
[ 8377.056874]  __handle_irq_event_percpu+0x68/0x2a0
[ 8377.061579]  handle_irq_event_percpu+0x40/0xa0
[ 8377.066146]  handle_irq_event+0x50/0xf0
[ 8377.069886]  handle_fasteoi_irq+0xc0/0x170
[ 8377.073670]  generic_handle_irq+0x40/0x60
[ 8377.077939]  __handle_domain_irq+0x70/0xd0
[ 8377.082138]  efi_header_end+0xb0/0xf0
[ 8377.085636]  el1_irq+0xd0/0x180
[ 8377.088611]  __do_softirq+0xb4/0x3e8
[ 8377.092289]  irq_exit+0xc0/0xe0
[ 8377.095021]  __handle_domain_irq+0x74/0xd0
[ 8377.099029]  efi_header_end+0xb0/0xf0
[ 8377.102960]  el1_irq+0xd0/0x180
[ 8377.105681]  ptep_clear_flush+0xcc/0xd0
[ 8377.109618]  wp_page_copy+0x1d4/0xbe0
[ 8377.113133]  do_wp_page+0xa0/0x530
[ 8377.116611]  handle_mm_fault+0xa30/0x1000
[ 8377.120652]  do_page_fault+0x118/0x400
[ 8377.124487]  do_mem_abort+0x54/0xb0
[ 8377.128167]  el0_da+0x38/0x50
[ 8377.131055]  el0_sync_handler+0x80/0xb0
[ 8377.135163]  el0_sync+0x16c/0x180
[ 8377.138322] Code: a9446bf9 a94573fb a8c67bfd d65f03c0 (d4210000) 
[ 8377.144618] ---[ end trace 057ee2343e7709db ]---
[ 8377.159765] Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt
[ 8377.160000] SMP: stopping secondary CPUs
[ 8377.160479] Kernel Offset: 0x3f00225b0000 from 0xffff800010000000
[ 8377.166929] PHYS_OFFSET: 0xfffff4a2c0000000
[ 8377.170889] CPU features: 0x8240002,03802a30
[ 8377.175501] Memory Limit: none
[ 8377.187070] Rebooting in 10 seconds..

And another (which is less common as it shows systemd-udevd as the running process instead of python):

[35758.793055] nvgpu: 17000000.gv11b     nvgpu_timeout_expired_msg_cpu:94   [ERR]  Timeout detected @ nvgpu_pmu_wait_fw_ack_status+0xbc/0x130 [nvgpu] 
[35758.793447] nvgpu: 17000000.gv11b           nvgpu_pmu_wait_fw_ready:167  [ERR]  PMU is not ready yet
[35758.802595] nvgpu: 17000000.gv11b               lsfm_int_wpr_region:65   [ERR]  PMU not ready to process requests
[35758.803030] nvgpu: 17000000.gv11b nvgpu_pmu_lsfm_bootstrap_ls_falcon:107  [ERR]  LSF init WPR region failed
[35758.805463] nvgpu: 17000000.gv11b nvgpu_pmu_lsfm_bootstrap_ls_falcon:128  [ERR]  LSF Load failed
[35758.805752] nvgpu: 17000000.gv11b nvgpu_gr_falcon_load_secure_ctxsw_ucode:727  [ERR]  Unable to boot GPCCS
[35758.806136] nvgpu: 17000000.gv11b        nvgpu_gr_falcon_init_ctxsw:159  [ERR]  fail
[35758.806389] nvgpu: 17000000.gv11b           nvgpu_report_err_to_sdl:66   [ERR]  Failed to report an error: hw_unit_id = 0x2, err_id=0x6, ss_err_id = 0x262
[35758.806789] nvgpu: 17000000.gv11b      gr_init_ctxsw_falcon_support:833  [ERR]  FECS context switch init error
[35758.807089] nvgpu: 17000000.gv11b            nvgpu_finalize_poweron:1010 [ERR]  Failed initialization for: g->ops.gr.gr_init_support
[35758.844682] nvgpu: 17000000.gv11b nvgpu_gpu_set_deterministic_ch_railgate:1858 [WRN]  cannot busy to restore deterministic ch
[35759.009931] CPU:0, Error: cbb-noc@2300000, irq=14
[35759.010095] **************************************
[35759.010233] CPU:0, Error:cbb-noc
[35759.010309] 	Error Logger		: 1
[35759.010385] 	ErrLog0			: 0x80030608
[35759.010460] 	  Transaction Type	: WR  - Write, Incrementing
[35759.010569] 	  Error Code		: TMO
[35759.010640] 	  Error Source		: Target NIU
[35759.010723] 	  Error Description	: Target time-out error
[35759.010842] 	  Packet header Lock	: 0
[35759.010915] 	  Packet header Len1	: 3
[35759.011015] 	  NOC protocol version	: version >= 2.7
[35759.011111] 	ErrLog1			: 0x340011
[35759.011179] 	ErrLog2			: 0x0
[35759.011237] 	  RouteId		: 0x340011
[35759.011306] 	  InitFlow		: ccroc_p2ps/I/ccroc_p2ps
[35759.011399] 	  Targflow		: gpu_p2pm/T/gpu_p2pm
[35759.011484] 	  TargSubRange		: 0
[35759.011576] 	  SeqId			: 0
[35759.011638] 	ErrLog3			: 0x810090
[35759.011706] 	ErrLog4			: 0x0
[35759.011862] 	  Address accessed	: 0x17810090
[35759.012168] 	ErrLog5			: 0x409f850
[35759.012409] 	  Non-Modify		: 0x1
[35759.012648] 	  AXI ID		: 0x8
[35759.014150] 	  Master ID		: CCPLEX
[35759.017136] 	  Security Group(GRPSEC): 0x7e
[35759.021597] 	  Cache			: 0x0 -- Device Non-Bufferable
[35759.026668] 	  Protection		: 0x2 -- Unprivileged, Non-Secure, Data Access
[35759.033496] 	  FALCONSEC		: 0x0
[35759.036642] 	  Virtual Queuing Channel(VQC): 0x0
[35759.041281] 	**************************************
[35759.046082] kernel BUG at drivers/soc/tegra/cbb/tegra194-cbb.c:2057!
[35759.052658] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[35759.057734] Modules linked in: veth(E) xt_recent(E) xt_statistic(E) xt_nat(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_tcpudp(E) ip6table_filter(E) ip6table_mangle(E) xt_comment(E) xt_mark(E) iptable_mangle(E) ip6table_nat(E) ip6_tables(E) ip_vs_rr(E) ip_vs(E) xt_conntrack(E) fuse(E) xt_MASQUERADE(E) nf_conntrack_netlink(E) nfnetlink(E) iptable_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) xt_addrtype(E) iptable_filter(E) lzo_rle(E) lzo_compress(E) zram(E) br_netfilter(E) overlay(E) ramoops(E) reed_solomon(E) loop(E) snd_soc_tegra186_dspk(E) snd_soc_tegra210_iqc(E) snd_soc_tegra186_asrc(E) snd_soc_tegra210_ope(E) snd_soc_tegra186_arad(E) snd_soc_tegra210_afc(E) snd_soc_tegra210_mvc(E) snd_soc_tegra210_dmic(E) snd_soc_tegra210_adx(E) snd_soc_tegra210_amx(E) snd_soc_tegra210_i2s(E) snd_soc_tegra210_admaif(E) snd_soc_tegra210_mixer(E) snd_soc_tegra_pcm(E) snd_soc_tegra210_sfc(E) aes_ce_blk(E) crypto_simd(E) cryptd(E) aes_ce_cipher(E) ghash_ce(E) sha2_ce(E)
[35759.058022]  sha256_arm64(E) binfmt_misc(E) sha1_ce(E) snd_soc_spdif_tx(E) snd_soc_tegra210_adsp(E) snd_soc_tegra210_ahub(E) max77620_thermal(E) nct1008(E) snd_soc_tegra_utils(E) tegra_bpmp_thermal(E) snd_soc_simple_card_utils(E) bluedroid_pm(E) nvadsp(E) userspace_alert(E) tegra210_adma(E) snd_hda_codec_hdmi(E) snd_hda_tegra(E) snd_hda_codec(E) snd_hda_core(E) spi_tegra114(E) ina3221(E) pwm_fan(E) nvgpu(E) nvmap(E) igb(OE) ip_tables(E) x_tables(E) ahci(E) libahci(E) libata(E) [last unloaded: mtd]
[35759.188659] CPU: 0 PID: 4985 Comm: python3 Tainted: G           OE     5.10.104-tegra #2
[35759.197025] Hardware name: Unknown Jetson-AGX/Jetson-AGX, BIOS 3.1-32827747 03/19/2023
[35759.204911] pstate: 60400089 (nZCv daIf +PAN -UAO -TCO BTYPE=--)
[35759.211212] pc : tegra194_cbb_err_isr+0x19c/0x1b0
[35759.215948] lr : tegra194_cbb_err_isr+0x11c/0x1b0
[35759.220649] sp : ffff800010003b40
[35759.224062] x29: ffff800010003b40 x28: 0000000000000001 
[35759.229574] x27: 0000000000000080 x26: ffffcf0f42b193d8 
[35759.235086] x25: ffffcf0f4347de38 x24: 0000000000000001 
[35759.240344] x23: ffffcf0f42e07000 x22: ffffcf0f4329ef30 
[35759.246111] x21: 000000000000000e x20: 0000000000000006 
[35759.251623] x19: ffffcf0f4329ef20 x18: 0000000000000010 
[35759.257136] x17: 0000000000000000 x16: ffffcf0f41423210 
[35759.262670] x15: ffff2b64ef3f13f0 x14: ffffffffffffffff 
[35759.268162] x13: ffff800090003737 x12: ffff800010003740 
[35759.273498] x11: 0000000000000038 x10: 0101010101010101 
[35759.279018] x9 : ffff800010003a50 x8 : 2a2a2a2a2a2a2a2a 
[35759.284523] x7 : 2a2a2a2a2a2a2a09 x6 : c0000000ffffefff 
[35759.289778] x5 : ffff2b6bbfd20958 x4 : ffffcf0f43127968 
[35759.295462] x3 : 0000000000000001 x2 : ffffcf0f415be170 
[35759.300814] x1 : ffff2b64ef3f0e80 x0 : 0000000000010100 
[35759.305879] Call trace:
[35759.308350]  tegra194_cbb_err_isr+0x19c/0x1b0
[35759.312884]  __handle_irq_event_percpu+0x68/0x2a0
[35759.317429]  handle_irq_event_percpu+0x40/0xa0
[35759.321891]  handle_irq_event+0x50/0xf0
[35759.325653]  handle_fasteoi_irq+0xc0/0x170
[35759.329436]  generic_handle_irq+0x40/0x60
[35759.333702]  __handle_domain_irq+0x70/0xd0
[35759.337667]  efi_header_end+0xb0/0xf0
[35759.341403]  el1_irq+0xd0/0x180
[35759.344119]  __do_softirq+0xb4/0x3e8
[35759.347816]  irq_exit+0xc0/0xe0
[35759.351027]  __handle_domain_irq+0x74/0xd0
[35759.354793]  efi_header_end+0xb0/0xf0
[35759.358741]  el0_irq_naked+0x4c/0x54
[35759.362231] Code: a9446bf9 a94573fb a8c67bfd d65f03c0 (d4210000) 
[35759.368363] ---[ end trace 4307a82fce3b412d ]---
[35759.383501] Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt
[35759.383708] SMP: stopping secondary CPUs
[35759.384205] Kernel Offset: 0x4f0f31400000 from 0xffff800010000000
[35759.390487] PHYS_OFFSET: 0xffffd49cc0000000
[35759.394347] CPU features: 0x8240002,03802a30
[35759.398798] Memory Limit: none
[35759.410248] Rebooting in 10 seconds..
ÿäÿâShutdown state requested 1
Rebooting system ...

And here is what I managed to capture on a AGX Xavier devkit using dmesg over ssh:

[Wed Jan 24 17:22:45 2024] nvgpu: 17000000.gv11b     nvgpu_timeout_expired_msg_cpu:94   [ERR]  Timeout detected @ nvgpu_pmu_wait_fw_ack_status+0xbc/0x130 [nvgpu]
[Wed Jan 24 17:22:45 2024] nvgpu: 17000000.gv11b           nvgpu_pmu_wait_fw_ready:167  [ERR]  PMU is not ready yet
[Wed Jan 24 17:22:45 2024] nvgpu: 17000000.gv11b               lsfm_int_wpr_region:65   [ERR]  PMU not ready to process requests
[Wed Jan 24 17:22:45 2024] nvgpu: 17000000.gv11b nvgpu_pmu_lsfm_bootstrap_ls_falcon:107  [ERR]  LSF init WPR region failed
[Wed Jan 24 17:22:45 2024] nvgpu: 17000000.gv11b nvgpu_pmu_lsfm_bootstrap_ls_falcon:128  [ERR]  LSF Load failed
[Wed Jan 24 17:22:45 2024] nvgpu: 17000000.gv11b nvgpu_gr_falcon_load_secure_ctxsw_ucode:727  [ERR]  Unable to boot GPCCS
[Wed Jan 24 17:22:45 2024] nvgpu: 17000000.gv11b        nvgpu_gr_falcon_init_ctxsw:159  [ERR]  fail
[Wed Jan 24 17:22:45 2024] nvgpu: 17000000.gv11b           nvgpu_report_err_to_sdl:66   [ERR]  Failed to report an error: hw_unit_id = 0x2, err_id=0x6, ss_err_id = 0x262
[Wed Jan 24 17:22:45 2024] nvgpu: 17000000.gv11b      gr_init_ctxsw_falcon_support:833  [ERR]  FECS context switch init error
[Wed Jan 24 17:22:45 2024] nvgpu: 17000000.gv11b            nvgpu_finalize_poweron:1010 [ERR]  Failed initialization for: g->ops.gr.gr_init_support
[Wed Jan 24 17:22:45 2024] nvgpu: 17000000.gv11b nvgpu_gpu_set_deterministic_ch_railgate:1858 [WRN]  cannot busy to restore deterministic ch
[Wed Jan 24 17:22:45 2024] CPU:0, Error: cbb-noc@2300000, irq=14
[Wed Jan 24 17:22:45 2024] **************************************
[Wed Jan 24 17:22:45 2024] CPU:0, Error:cbb-noc
[Wed Jan 24 17:22:45 2024]      Error Logger            : 1
[Wed Jan 24 17:22:45 2024]      ErrLog0                 : 0x80030608
[Wed Jan 24 17:22:45 2024]        Transaction Type      : WR  - Write, Incrementing
[Wed Jan 24 17:22:45 2024]        Error Code            : TMO
[Wed Jan 24 17:22:45 2024]        Error Source          : Target NIU
[Wed Jan 24 17:22:45 2024]        Error Description     : Target time-out error
[Wed Jan 24 17:22:45 2024]        Packet header Lock    : 0
[Wed Jan 24 17:22:45 2024]        Packet header Len1    : 3
[Wed Jan 24 17:22:45 2024]        NOC protocol version  : version >= 2.7

Also we occasionally see the following kernel message (time varies):

hrtimer: interrupt took 75648 ns

I would like to formulate a minimal test to replicate the issue, but it sort of seems like its related to a race condition and for whatever reason our particular setup has just maximized the probability of the race condition occurring. I have been unable to replicate it with just a single node running the machine learning app and a rabbitmq instance to feed the app, I even attempted to do this while using stress (https://linux.die.net/man/1/stress) with various flags to no avail.

Hi,

Please directly test with jetpack5.1.2. We don’t actually check issue on old release version.

We have upgraded our AGX xavier devkit cluster to jetpack 5.1.2 and have confirmed the issue is still occurring. Here is what was captured via dmesg over ssh:

[Mon Jan 29 16:31:18 2024] nvgpu: 17000000.gv11b     nvgpu_timeout_expired_msg_cpu:94   [ERR]  Timeout detected @ nvgpu_pmu_wait_fw_ack_status+0xbc/0x130 [nvgpu]
[Mon Jan 29 16:31:18 2024] nvgpu: 17000000.gv11b           nvgpu_pmu_wait_fw_ready:167  [ERR]  PMU is not ready yet
[Mon Jan 29 16:31:18 2024] nvgpu: 17000000.gv11b               lsfm_int_wpr_region:65   [ERR]  PMU not ready to process requests
[Mon Jan 29 16:31:18 2024] nvgpu: 17000000.gv11b nvgpu_pmu_lsfm_bootstrap_ls_falcon:107  [ERR]  LSF init WPR region failed
[Mon Jan 29 16:31:18 2024] nvgpu: 17000000.gv11b nvgpu_pmu_lsfm_bootstrap_ls_falcon:128  [ERR]  LSF Load failed
[Mon Jan 29 16:31:18 2024] nvgpu: 17000000.gv11b nvgpu_gr_falcon_load_secure_ctxsw_ucode:727  [ERR]  Unable to boot GPCCS
[Mon Jan 29 16:31:18 2024] nvgpu: 17000000.gv11b        nvgpu_gr_falcon_init_ctxsw:159  [ERR]  fail
[Mon Jan 29 16:31:18 2024] nvgpu: 17000000.gv11b           nvgpu_report_err_to_sdl:66   [ERR]  Failed to report an error: hw_unit_id = 0x2, err_id=0x6, ss_err_id = 0x62
[Mon Jan 29 16:31:18 2024] nvgpu: 17000000.gv11b      gr_init_ctxsw_falcon_support:833  [ERR]  FECS context switch init error
[Mon Jan 29 16:31:18 2024] nvgpu: 17000000.gv11b            nvgpu_finalize_poweron:1010 [ERR]  Failed initialization for: g->ops.gr.gr_init_support
[Mon Jan 29 16:31:18 2024] nvgpu: 17000000.gv11b nvgpu_gpu_set_deterministic_ch_railgate:1861 [WRN]  cannot busy to restore deterministic ch
[Mon Jan 29 16:31:18 2024] CPU:0, Error: cbb-noc@2300000, irq=14
[Mon Jan 29 16:31:18 2024] **************************************
[Mon Jan 29 16:31:18 2024] CPU:0, Error:cbb-noc
[Mon Jan 29 16:31:18 2024]      Error Logger            : 1
[Mon Jan 29 16:31:18 2024]      ErrLog0                 : 0x80030608

Could you share the full uart log but not just few lines?

node05_crash_uart_log_02062024.log (167.4 KB)

Sorry for the delay, I have attached a UART log that contains a normal shutdown and reboot sequence and a crash. Took a while for the node in the cluster that we were debugging to crash instead of other nodes (which tends to require a manual reset of the test, making it take some time)

Oh also, another node which I was monitoring with dmesg crashed in an unusual way. Occasionally we will see the messages about the timeout occurring but the node doesn’t actually crash at that time. This log captured that scenario, but then 3 days later there was a kernel null pointer dereference error resulting in a panic. I have not seen/noticed this before.
node02-crash-02052024_dmesg.txt (146.8 KB)

I am wondering about something I see in the logs right before nvgpu shows up in logs:

[Thu Feb  1 02:39:34 2024] ramoops: using 0x200000@0xc8200000, ecc: 0                                         
[Thu Feb  1 02:39:34 2024] IRQ229: set affinity failed(-22).                                                  
[Thu Feb  1 02:39:34 2024] IRQ289: set affinity failed(-22).                                                  
[Thu Feb  1 02:39:34 2024] IRQ292: set affinity failed(-22).                                                  
[Thu Feb  1 02:39:34 2024] CPU4: shutdown              
[Thu Feb  1 02:39:34 2024] psci: CPU4 killed (polled 0 ms)                                                    
[Thu Feb  1 02:39:34 2024] IRQ229: set affinity failed(-22).                                                  
[Thu Feb  1 02:39:34 2024] IRQ289: set affinity failed(-22).                                                  
[Thu Feb  1 02:39:34 2024] IRQ292: set affinity failed(-22).                                                  
[Thu Feb  1 02:39:34 2024] CPU5: shutdown              
[Thu Feb  1 02:39:34 2024] psci: CPU5 killed (polled 0 ms)                                                    
[Thu Feb  1 02:39:34 2024] IRQ229: set affinity failed(-22).                                                  
[Thu Feb  1 02:39:34 2024] IRQ289: set affinity failed(-22).                                                  
[Thu Feb  1 02:39:34 2024] IRQ292: set affinity failed(-22).                                                  
[Thu Feb  1 02:39:34 2024] CPU6: shutdown              
[Thu Feb  1 02:39:34 2024] psci: CPU6 killed (polled 0 ms)                                                    
[Thu Feb  1 02:39:34 2024] IRQ229: set affinity failed(-22).                                                  
[Thu Feb  1 02:39:34 2024] CPU7: shutdown              
[Thu Feb  1 02:39:34 2024] psci: CPU7 killed (polled 0 ms)    

Have you in any way tried to assign or change CPU affinity? Is it possible the software you are using expects to be able to assign affinity (the software would need to run as root/sudo)?

Also, are you in max power mode, or perhaps in a lower power mode?

Yes those messages correspond with manually switching the power mode to 15W_Desktop to match how they are running in production. We previously replicated the issue in 30W_ALL mode to see if that was the underlying cause, that was with either jetpack 5.1 or 5.1.1.

What about setting affinity? If a core is down due to being in a power model that does not turn on the core, then affinity cannot succeed to that core. Some IRQs cannot be set to cores other than the first core as well. It is important to know if you are setting affinities, or if this is a result of the default content.

We are not explicitly setting affinity. This is just what happens when we change the power mode. I was concerned about this initially, but supposedly it can be ignored:

Also, for the our AGX devkits on jetpack 5.1.2 the listed IRQ’s that fail setting affinity are 229, 290 and 294 which show up like this in /proc/interrupts:

           CPU0       CPU1       CPU2       CPU3       
229:         25          0          0          0   PCI-MSI 524288 Edge      nvme0q0
290:          0          0          0          0      gpio 149 Edge      bluetooth hostwake
294:      21559          0          0          0   PCI-MSI 403177472 Edge      iwlwifi

Edit: the actual IRQ numbers apparently vary depending on the device, but the mapping seems to be the same

I don’t know if the scheduler is correctly moving the IRQ to the correct core, but it might not matter (someone from NVIDIA would have to answer that). Certainly though if the PCI for the NVMe is not working it would cause a failure if the NVMe is in use at the time…a drive which is in use cannot umount, and it would cause a crash if the PCI bus just turned off. If this is tested without NVMe, then most likely nobody would notice the PCI IRQ changing or going away. I am thinking NVIDIA would have to answer if that PCI IRQ, in combination with the NVMe, is causing the failure (I’m assuming the NVMe is in use when the IRQ changes or fails).

Yeah I imagined it is trying to set the affinity for some IRQ’s to target cores 0-3 after half of them are shut off, and presumably, some of the IRQ’s that must hit core 0 exclusively are included in this logic–which is just causing the kernel to say “you can’t do that”. I should also point out that those affinity messages occur long before the actual kernel panic-- like hours or days.

When I do nvpmodel -m 3 I see:

□□I/TC: Secondary CPU 4 initializing
I/TC: Secondary CPU 4 switching to normal world boot
I/TC: Secondary CPU 5 initializing
I/TC: Secondary CPU 5 switching to normal world boot
I/TC: Secondary CPU 6 initializing
I/TC: Secondary CPU 6 switching to normal world boot
I/TC: Secondary CPU 7 initializing
I/TC: Secondary CPU 7 switching to normal world boot

and then back to 7:

□□[221515.909020] migrate_one_irq: 2 callbacks suppressed
[221515.909036] IRQ229: set affinity failed(-22).
[221515.909370] IRQ289: set affinity failed(-22).
[221515.909494] IRQ294: set affinity failed(-22).
[221515.910165] CPU4: shutdown
[221515.969400] IRQ229: set affinity failed(-22).
[221515.969624] IRQ289: set affinity failed(-22).
[221515.969719] IRQ294: set affinity failed(-22).
[221515.970302] CPU5: shutdown
[221516.108941] IRQ229: set affinity failed(-22).
[221516.109175] IRQ289: set affinity failed(-22).
[221516.109286] IRQ294: set affinity failed(-22).
[221516.110032] CPU6: shutdown
[221516.166044] IRQ229: set affinity failed(-22).
[221516.166867] CPU7: shutdown

When I check smp_affinity_list for all of the interrupts above in either mode they are always 0-7.

Also, this is the same machine that was previously showing 229, 290 and 294 in those messages… so I guess the list I gave above was wrong (are IRQ assignments supposed to vary between boots?) 229, 290 and 294 are currently:

229:         25          0          0          0   PCI-MSI 524288 Edge      nvme0q0
289:          0          0          0          0   PCI-MSI 134742016 Edge      ahci[0001:01:00.0]
294:      22103          0          0          0   PCI-MSI 403177472 Edge      iwlwifi

Edit: effective affinity is always 00 in either mode btw

The core which is actually used is managed by the scheduler. If scheduler policy changes, then the core can change. If the scheduler knows a core is down, but it was told the IRQ should go to that core, then it would be the scheduler which reassigns to the new core. Actual IRQ could then differ from any manually set affinity.

The scheduler will attempt to take advantage of cache, and so there is a pressure applied to keeping threads using a given set of memory on the same core, but this can easily change where the scheduler actually puts a thread if there is some conflict or sufficient pressure to move to a different core.

I don’t know if hardware IRQs might enumerate in a different order or not. For example, it seems plausible that a change in the device tree, even if it is just the order of nodes appearing, might change the numeric name of a given hardware IRQ. I don’t know. It is also possible in the case of kernel modules that the load order of modules might result in different hardware IRQs. I also don’t know if that is valid, but it seems possible. The actual wiring would be set to a given pin on the module (or internally), and this wouldn’t change, but the numeric name for that hardware IRQ isn’t necessarily set in stone. I couldn’t tell you what this particular case actually enforces.

The part which is obvious though, and which I have little doubt about, is that if the NVMe is in use, and something invalidates the IRQ (this is beyond rescheduling), then the kernel would likely panic. It is quite possible that the IRQ affinity has nothing to do with your issue, but if for any reason the invalid affinity of the NVMe (PCI) fails rather than simply moving to a different core, there would be a kernel panic.

SMP affinity configuration is a suggestion to the scheduler. Where the actual IRQ routes is the question…was there an issue which destroyed access to the rootfs while it was in use such that something was lost? I find it suspicious.

If you solve affinity errors, then test and still see the problem, it changes and narrows the scope of the problem.

I honestly wouldn’t even know where to begin trying to address the affinity issue.

I looked at the kernel code, based on that I am pretty sure nvmeq0 has to do with admin commands… so I installed nvme-cli and ran sudo nvme list a bunch of times and can see interrupts incrementing on core 0 with no issues/kernel errors, despite the affinity warning. I tried this in both power modes, and it only ever causes interrupts to increment for core 0. I don’t know if that means anything… but I am sort of out of my element here.

Something NVIDIA might be able to answer is if the default software in JetPack 5 ever adjusts affinity such that it might attempt to put the NVMe on some core other than zero. I don’t know the cause, but seeing a panic as there is a failed IRQ for the NVMe is quite high on the “suspicious” list. For system which can handle the IRQ on any core wouldn’t have an issue, but I suspect NVIDIA did not set the NVMe to run on other CPU cores (the affinity messages, in this case, seem quite unusual).

@WayneWWW is the information I have supplied sufficient to investigate this issue?

Do you have other module that can cross check this issue?

I don’t see other user reporting similar error. It could be single module hardware problem so better doing cross check.

This issue has occurred across all advantech MIC-730ai devices which we have tested on, as well as all 5 agx xavier devkits we have tested on.

Are you sure Advantech tells you to use NVIDIA in stock BSP to flash their board? In most cases they have their own BSP provided.

Your log indicates you directly flashed NV jetpack BSP to a custom board.

Advantech provided us with their own BSP to flash, which we are using. But the devkits are using a stock BSP provided through the NVIDIA SDK manager and they experience the same issue.

The logs I attached from Feb 6 are taken from devkits running 5.1.2 flashed using the SDK manager. Advantech does not yet provide a 5.1.2 version, so it has only been replicated on the MIC-730ai devices on 5.1 and 5.1.1