Hello, we are experience an issue that seems to be related to an upgrade from Jetpack 4.4 to Jetpack 5.1.1 on a AdvanTech MIC-730AI device. At first all we were seeing was a corrupted syslog and crashing every few hours. I do not have physical access to these devices, and no one is available to open the device up and connect to the internal com ports for debugging, so instead what I have done is follow dmesg logs, which yielded the following error prior to crashing:
[ 8231.360760] nvgpu: 17000000.gv11b nvgpu_timeout_expired_msg_cpu:94 [ERR] Timeout detected @ nvgpu_pmu_wait_fw_ack_status+0x
bc/0x130 [nvgpu]
[ 8231.361143] nvgpu: 17000000.gv11b nvgpu_pmu_wait_fw_ready:167 [ERR] PMU is not ready yet
[ 8231.361327] nvgpu: 17000000.gv11b lsfm_int_wpr_region:65 [ERR] PMU not ready to process requests
[ 8231.361535] nvgpu: 17000000.gv11b nvgpu_pmu_lsfm_bootstrap_ls_falcon:107 [ERR] LSF init WPR region failed
[ 8231.361744] nvgpu: 17000000.gv11b nvgpu_pmu_lsfm_bootstrap_ls_falcon:128 [ERR] LSF Load failed
[ 8231.361929] nvgpu: 17000000.gv11b nvgpu_gr_falcon_load_secure_ctxsw_ucode:727 [ERR] Unable to boot GPCCS
[ 8231.362135] nvgpu: 17000000.gv11b nvgpu_gr_falcon_init_ctxsw:159 [ERR] fail
[ 8231.362308] nvgpu: 17000000.gv11b nvgpu_report_err_to_sdl:66 [ERR] Failed to report an error: hw_unit_id = 0x2, err_i
d=0x6, ss_err_id = 0x262
[ 8231.362738] nvgpu: 17000000.gv11b gr_init_ctxsw_falcon_support:833 [ERR] FECS context switch init error
[ 8231.363481] nvgpu: 17000000.gv11b nvgpu_finalize_poweron:1010 [ERR] Failed initialization for: g->ops.gr.gr_init_suppo
rt
[ 8231.416564] nvgpu: 17000000.gv11b nvgpu_gpu_set_deterministic_ch_railgate:1858 [WRN] cannot busy to restore deterministic ch
[ 8231.581606] CPU:0, Error: cbb-noc@2300000, irq=14
[ 8231.581735] **************************************
[ 8231.581836] CPU:0, Error:cbb-noc
[ 8231.581929] Error Logger : 1
[ 8231.582024] ErrLog0 : 0x80030608
[ 8231.582103] Transaction Type : WR - Write, Incrementing
[ 8231.582213] Error Code : TMO
[ 8231.582285] Error Source : Target NIU
[ 8231.582367] Error Description : Target time-out error
[ 8231.582492] Packet header Lock : 0
[ 8231.582569] Packet header Len1 : 3
[ 8231.582637] NOC protocol version : version >= 2.7
[ 8231.582730] ErrLog1 : 0x340013
[ 8231.582796] ErrLog2 : 0x0
[ 8231.582851] RouteId : 0x340013
[ 8231.582920] InitFlow : ccroc_p2ps/I/ccroc_p2ps
[ 8231.583020] Targflow : gpu_p2pm/T/gpu_p2pm
[ 8231.583105] TargSubRange : 0
[ 8231.583194] SeqId : 0
[ 8231.583256] ErrLog3 : 0x810090
[ 8231.583322] ErrLog4 : 0x0
[ 8231.583524] Address accessed : 0x17810090
[ 8231.583838] ErrLog5 : 0x489f850
[ 8231.584088] Non-Modify : 0x1
[ 8231.584325] AXI ID : 0x9
[ 8231.585569] Master ID : CCPLEX
[ 8231.588811] Security Group(GRPSEC): 0x7e
In addition to running tensorflow tensor rt models, we are also running a kubernetes k3s cluster and various infrastructure. The gpu is processing data through our models for up to several hours prior to crashing.