Intermittent Kernel Crash Following Upgrade to Jetpack 5.1.1

Hello, we are experience an issue that seems to be related to an upgrade from Jetpack 4.4 to Jetpack 5.1.1 on a AdvanTech MIC-730AI device. At first all we were seeing was a corrupted syslog and crashing every few hours. I do not have physical access to these devices, and no one is available to open the device up and connect to the internal com ports for debugging, so instead what I have done is follow dmesg logs, which yielded the following error prior to crashing:

[ 8231.360760] nvgpu: 17000000.gv11b     nvgpu_timeout_expired_msg_cpu:94   [ERR]  Timeout detected @ nvgpu_pmu_wait_fw_ack_status+0x
bc/0x130 [nvgpu]
[ 8231.361143] nvgpu: 17000000.gv11b           nvgpu_pmu_wait_fw_ready:167  [ERR]  PMU is not ready yet
[ 8231.361327] nvgpu: 17000000.gv11b               lsfm_int_wpr_region:65   [ERR]  PMU not ready to process requests
[ 8231.361535] nvgpu: 17000000.gv11b nvgpu_pmu_lsfm_bootstrap_ls_falcon:107  [ERR]  LSF init WPR region failed
[ 8231.361744] nvgpu: 17000000.gv11b nvgpu_pmu_lsfm_bootstrap_ls_falcon:128  [ERR]  LSF Load failed
[ 8231.361929] nvgpu: 17000000.gv11b nvgpu_gr_falcon_load_secure_ctxsw_ucode:727  [ERR]  Unable to boot GPCCS
[ 8231.362135] nvgpu: 17000000.gv11b        nvgpu_gr_falcon_init_ctxsw:159  [ERR]  fail
[ 8231.362308] nvgpu: 17000000.gv11b           nvgpu_report_err_to_sdl:66   [ERR]  Failed to report an error: hw_unit_id = 0x2, err_i
d=0x6, ss_err_id = 0x262
[ 8231.362738] nvgpu: 17000000.gv11b      gr_init_ctxsw_falcon_support:833  [ERR]  FECS context switch init error
[ 8231.363481] nvgpu: 17000000.gv11b            nvgpu_finalize_poweron:1010 [ERR]  Failed initialization for: g->ops.gr.gr_init_suppo
rt
[ 8231.416564] nvgpu: 17000000.gv11b nvgpu_gpu_set_deterministic_ch_railgate:1858 [WRN]  cannot busy to restore deterministic ch
[ 8231.581606] CPU:0, Error: cbb-noc@2300000, irq=14
[ 8231.581735] **************************************
[ 8231.581836] CPU:0, Error:cbb-noc
[ 8231.581929]  Error Logger            : 1
[ 8231.582024]  ErrLog0                 : 0x80030608
[ 8231.582103]    Transaction Type      : WR  - Write, Incrementing
[ 8231.582213]    Error Code            : TMO
[ 8231.582285]    Error Source          : Target NIU
[ 8231.582367]    Error Description     : Target time-out error
[ 8231.582492]    Packet header Lock    : 0
[ 8231.582569]    Packet header Len1    : 3
[ 8231.582637]    NOC protocol version  : version >= 2.7
[ 8231.582730]  ErrLog1                 : 0x340013
[ 8231.582796]  ErrLog2                 : 0x0
[ 8231.582851]    RouteId               : 0x340013
[ 8231.582920]    InitFlow              : ccroc_p2ps/I/ccroc_p2ps
[ 8231.583020]    Targflow              : gpu_p2pm/T/gpu_p2pm
[ 8231.583105]    TargSubRange          : 0
[ 8231.583194]    SeqId                 : 0
[ 8231.583256]  ErrLog3                 : 0x810090
[ 8231.583322]  ErrLog4                 : 0x0
[ 8231.583524]    Address accessed      : 0x17810090
[ 8231.583838]  ErrLog5                 : 0x489f850
[ 8231.584088]    Non-Modify            : 0x1
[ 8231.584325]    AXI ID                : 0x9
[ 8231.585569]    Master ID             : CCPLEX
[ 8231.588811]    Security Group(GRPSEC): 0x7e

In addition to running tensorflow tensor rt models, we are also running a kubernetes k3s cluster and various infrastructure. The gpu is processing data through our models for up to several hours prior to crashing.

Suggest to contact with AdvanTech to get the support for the kernel crash issue first due to you’re using their carrier board and BSP.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.