Jetson ORIN NANO 8GB kernel error

Hello, we have a NVIDIA ORIN NANO 8GB box, and running DL inference server on it. It use REST API provide image inference, but recently the status is abnormal. Inference process will quickly disappear, killed or exit, not sure.

There are many errors in /var/log/syslog:

nvidia@nvidia:/logs/bak$ tail /var/log/kern.log
Apr 23 15:39:56 nvidia kernel: [1233149.724310] NVRM nvAssertFailedNoLog: Assertion failed: NV_FALSE @ gpu_mgr.c:296
Apr 23 15:39:57 nvidia kernel: [1233150.425566] NVRM rpcRmApiControl_dce: NVRM_RPC_DCE: Failed RM ctrl call cmd:0x2080013f result 0x56:
Apr 23 15:39:57 nvidia kernel: [1233150.426438] NVRM rpcRmApiControl_dce: NVRM_RPC_DCE: Failed RM ctrl call cmd:0x2080017e result 0x56:
Apr 23 15:39:57 nvidia kernel: [1233150.429764] NVRM rpcRmApiControl_dce: NVRM_RPC_DCE: Failed RM ctrl call cmd:0x2080014a result 0x56:
Apr 23 15:39:57 nvidia kernel: [1233150.468826] NVRM rpcRmApiControl_dce: NVRM_RPC_DCE: Failed RM ctrl call cmd:0x731341 result 0xffff:
Apr 23 15:39:57 nvidia kernel: [1233150.469224] NVRM rpcRmApiControl_dce: NVRM_RPC_DCE: Failed RM ctrl call cmd:0x730190 result 0x56:
Apr 23 15:39:57 nvidia kernel: [1233150.604976] NVRM gpumgrGetSomeGpu: Failed to retrieve pGpu - Too early call!.
Apr 23 15:39:57 nvidia kernel: [1233150.604983] NVRM nvAssertFailedNoLog: Assertion failed: NV_FALSE @ gpu_mgr.c:296
Apr 23 15:39:57 nvidia kernel: [1233150.928171] cpufreq: cpu0,cur:1501000,set:960000,set ndiv:75
Apr 23 15:40:04 nvidia kernel: [1233157.023131] cpufreq: cpu0,cur:1145000,set:729600,set ndiv:57
nvidia@nvidia:/logs/bak$ tail /var/log/kern.log
Apr 23 15:40:36 nvidia kernel: [1233189.551729] cpufreq: cpu0,cur:1191000,set:960000,set ndiv:75
Apr 23 15:40:36 nvidia kernel: [1233189.553505] cpufreq: cpu0,cur:1135000,set:1510400,set ndiv:118
Apr 23 15:40:37 nvidia kernel: [1233190.570364] cpufreq: cpu4,cur:1286000,set:1510400,set ndiv:118
Apr 23 15:40:38 nvidia kernel: [1233191.585591] cpufreq: cpu0,cur:1503000,set:729600,set ndiv:57
Apr 23 15:40:41 nvidia kernel: [1233194.632140] cpufreq: cpu0,cur:1323000,set:1510400,set ndiv:118
Apr 23 15:40:45 nvidia kernel: [1233198.695936] cpufreq: cpu0,cur:1264000,set:1510400,set ndiv:118
Apr 23 15:40:47 nvidia kernel: [1233200.730169] cpufreq: cpu0,cur:958000,set:1510400,set ndiv:118
Apr 23 15:40:48 nvidia kernel: [1233201.648874] cpufreq: cpu0,cur:1158000,set:729600,set ndiv:57
Apr 23 15:40:49 nvidia kernel: [1233202.665527] cpufreq: cpu0,cur:1179000,set:1510400,set ndiv:118
Apr 23 15:40:49 nvidia kernel: [1233202.760847] cpufreq: cpu0,cur:1238000,set:729600,set ndiv:57
nvidia@nvidia:/logs/bak$ tail /var/log/kern.log
Apr 23 15:40:52 nvidia kernel: [1233205.085634] NVRM rpcRmApiControl_dce: NVRM_RPC_DCE: Failed RM ctrl call cmd:0x2080017e result 0x56:
Apr 23 15:40:52 nvidia kernel: [1233205.089077] NVRM rpcRmApiControl_dce: NVRM_RPC_DCE: Failed RM ctrl call cmd:0x2080014a result 0x56:
Apr 23 15:40:52 nvidia kernel: [1233205.127339] NVRM rpcRmApiControl_dce: NVRM_RPC_DCE: Failed RM ctrl call cmd:0x731341 result 0xffff:
Apr 23 15:40:52 nvidia kernel: [1233205.127802] NVRM rpcRmApiControl_dce: NVRM_RPC_DCE: Failed RM ctrl call cmd:0x730190 result 0x56:
Apr 23 15:40:52 nvidia kernel: [1233205.259640] NVRM gpumgrGetSomeGpu: Failed to retrieve pGpu - Too early call!.
Apr 23 15:40:52 nvidia kernel: [1233205.259647] NVRM nvAssertFailedNoLog: Assertion failed: NV_FALSE @ gpu_mgr.c:296
Apr 23 15:40:53 nvidia kernel: [1233206.822477] cpufreq: cpu0,cur:1113000,set:1510400,set ndiv:118
Apr 23 15:40:53 nvidia kernel: [1233206.824891] cpufreq: cpu4,cur:1196000,set:1510400,set ndiv:118
Apr 23 15:40:56 nvidia kernel: [1233209.870925] cpufreq: cpu0,cur:1333000,set:1190400,set ndiv:93
Apr 23 15:40:56 nvidia kernel: [1233209.872137] cpufreq: cpu0,cur:1373000,set:1510400,set ndiv:118

P.S.:
jetson_release information:


root@nvidia:/home/nvidia# jetson_release -v
Software part of jetson-stats 4.2.1 - (c) 2023, Raffaello Bonghi
Model: NVIDIA Orin Nano Developer Kit - Jetpack 5.1.1 [L4T 35.3.1]
NV Power Mode[0]: 15W
Serial Number: [XXX Show with: jetson_release -s XXX]
Hardware:
 - 699-level Part Number: 699-13767-0003-300 L.3
 - P-Number: p3767-0003
 - Module: NVIDIA Jetson Orin Nano (8GB ram)
 - SoC: tegra23x
 - CUDA Arch BIN: 8.7
 - Codename: P3768
Platform:
 - Machine: aarch64
 - System: Linux
 - Distribution: Ubuntu 20.04 Focal Fossa
 - Release: 5.10.104-tegra
 - Python: 3.8.10
jtop:
 - Version: 4.2.1
 - Service: Inactive
Libraries:
 - CUDA: 11.4.315
 - cuDNN: 8.6.0.166
 - TensorRT: 5.1.1
 - VPI: 2.2.7
 - OpenCV: 4.5.4 - with CUDA: NO
root@nvidia:/home/nvidia#

Hi,

Killed is usually caused by being out of memory.
Please help to verify this by running the tegrastats tool concurrently.

$ sudo tegrastats

Thanks.