Drive Orin System hang

user88818 · February 21, 2023, 4:18am

Please provide the following info (tick the boxes after creating this topic):
Software Version
[*] DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)
DRIVE OS 6.0.4 SDK
other

Target Operating System
[*] Linux
QNX
other

Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
[*] DRIVE AGX Orin Developer Kit (not sure its number)
other

SDK Manager Version
[*] 1.9.1.10844
other

Host Machine Version
[*] native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other

Running some simple CUDA testing program, after few minutes. whole system hang. ssh lost response. console print following error msg:
searched keywords from SDK source code, seems those message not from kernel. it from firmware?
how to debug this? how to get firmware source code, and build it? thanks.

Time taken from Error Reporting to SEH: 388 microseconds
Time taken from Error Reporting to SEH: 353 microseconds
Time taken from Error Reporting to SEH: 264 microseconds
Time taken from Error Reporting to SEH: 140 microseconds
DemoApp: ErrCode-0x30000004 ReptrId-0x8060 ErrAttr-0x0
EPS TimeStamp: 0xd038ed1c
DemoApp: ErrCode-0x30000008 ReptrId-0x8060 ErrAttr-0x0
EPS TimeStamp: 0xd038fef2
DemoApp: ErrCode-0x30000004 ReptrId-0x8060 ErrAttr-0x0
EPS TimeStamp: 0xd0390f5e
DemoApp: ErrCode-0x89abcdef ReptrId-0x8013 ErrAttr-0x0
EPS TimeStamp: 0xd039242e
process qm error
��
 MPHY0 periodic read back Error. 0x0, 0x0, 0x1010100
 MPHY0 periodic read back Error. 0x1, 0xdad800, 0x400
 MPHY0 periodic read back Error. 0x2, 0x0, 0x1010100
 MPHY0 periodic read back Error. 0x3, 0x0, 0x400
 MPHY1 periodic read back Error. 0x0, 0x0, 0x1010100
 MPHY1 periodic read back Error. 0x1, 0xdad800, 0x400
 MPHY1 periodic read back Error. 0x2, 0x0, 0x1010100
 MPHY1 periodic read back Error. 0x3, 0x0, 0x400
 MPHY0 periodic read back Error. 0x1, 0xdad800, 0x300
 MPHY0 periodic read back Error. 0x3, 0x0, 0x0
 MPHY1 periodic read back Error. 0x1, 0xdad800, 0x300
 MPHY1 periodic read back Error. 0x3, 0x0, 0x0
��iodic read back Error. 0x1, 0xdad800, 0x300_foundation/3rdparty/arm/arm-trusted-firmware/../arm-trusted-firmware-private/plat/nvidia/tegra/soc/t234/plat_sip_calls.c <128>
 MPHY0 periodic read back Error. 0x3, 0x0, 0x0
 MPHY1 periodic read back Error. 0x1, 0xdad800, 0x300
 MPHY1 periodic read back Error. 0x3, 0x0, 0x0

SivaRamaKrishnaNV · February 21, 2023, 6:06am

Dear @user88818,
ssh lost response. console print following error msg

Does hard reboot of target fixes the issue? what is the frequency of this error? Do you see these logs on Tegra A console? Does running CUDA sample triggered this issue? If you can not access Tegra console after reboot, check running aurixreset on aurix console.

user88818 · February 21, 2023, 6:49pm

Hi, thanks for you response.

Yes hard reboot(power off → power on) will fix the issue.
it is easy to reproduce.
those logs from /dev/ttyACM0, the console I use to login to Ubuntu.
we write a simple CUDA program, if you need, we may provide the simple program.
before I reboot, the ubuntu system hang, but I still can access /dev/ttyACM1. I can use poweroff → poweron command to reboot ubuntu system.
some times, I got following error from ubuntu kernel ( /dev/ttyACM0)

[ 1285.702785] task:utempter        state:D stack:    0 pid: 3626 ppid:  1925 flags:0x00000000
[ 1285.702787] Call trace:
[ 1285.702787]  __switch_to+0xc8/0x120
[ 1285.702790]  __schedule+0x344/0x900
[ 1285.702793]  schedule+0x64/0x120
[ 1285.702795]  wait_transaction_locked+0x88/0xe0
[ 1285.702796]  add_transaction_credits+0x58/0x310
[ 1285.702798]  start_this_handle+0xfc/0x4d0
[ 1285.702799]  jbd2__journal_start+0x128/0x280
[ 1285.702800]  __ext4_journal_start_sb+0x1ac/0x1d0
[ 1285.702803]  ext4_dirty_inode+0x50/0xa0
[ 1285.702805]  __mark_inode_dirty+0x1fc/0x500
[ 1285.702807]  generic_update_time+0x74/0x100
[ 1285.702808]  inode_update_time+0x58/0x70
[ 1285.702810]  file_update_time+0xe0/0x120
[ 1285.702811]  file_modified+0x38/0x50
[ 1285.702813]  ext4_buffered_write_iter+0x6c/0x180
[ 1285.702814]  ext4_file_write_iter+0x64/0x7a0
[ 1285.702816]  new_sync_write+0xfc/0x1a0
[ 1285.702818]  vfs_write+0x2a4/0x3d0
[ 1285.702821]  ksys_write+0x7c/0x110
[ 1285.702822]  __arm64_sys_write+0x28/0x40
[ 1285.702824]  el0_svc_common.constprop.0+0x84/0x1d0
[ 1285.702826]  do_el0_svc+0x38/0xb0
[ 1285.702829]  el0_svc+0x1c/0x30
[ 1285.702831]  el0_sync_handler+0xa8/0xb0
[ 1285.702834]  el0_sync+0x16c/0x180
[ 1285.702835] rcu: ====For debug only: End Printing Blocked Tasks====<print_other_cpu_stall>

user88818 · February 21, 2023, 7:04pm

some more log from ( /dev/ttyACM0), not sure if it is related.

fatal error no data cpu:0 err:-4
fatal error no data cpu:1 err:-4
fatal error no data cpu:2 err:-4
fatal error no data cpu:3 err:-4
fatal error no data cpu:4 err:-4
fatal error no data cpu:5 err:-4
fatal error no data cpu:6 err:-4
fatal error no data cpu:7 err:-4
fatal error no data cpu:8 err:-4
fatal error no data cpu:9 err:-4
fatal error no data cpu:10 err:-4
fatal error no data cpu:11 err:-4
process non fatal
non fatal error cpu:1 1
non fatal error cpu:1 2
non fatal error cpu:1 3

Time taken from Error Reporting to SEH: 389 microseconds
Time taken from Error Reporting to SEH: 316 microseconds

VickNV · February 21, 2023, 7:29pm

Can you attempt to replicate the issue by running any CUDA sample application, as @SivaRamaKrishnaNV suggested?

user88818 · March 6, 2023, 5:25pm

Hi we were able to reproduce the issue with a simplified tensorrt program. I’ve attached source code here.

Sample description

This sample is modified based on the sample from DLA and GPU cores at the same time - #3 by eyalhir74
The program will start 30 threads (std::thread), each thread repeatedly runs the model(resnet fp16), then about 10~50 minutes, the program and the system will stuck, some cmds cannot be exected, and there is no output, such as head, top, ssh, iostat etc…
The gpu.engine come from https://github.com/onnx/models/raw/main/vision/classification/resnet/model/resnet50-v2-7.onnx
We convert resnet to fp16 engine by

/usr/src/tensorrt/bin/trtexec --onnx=resnet50-v2-7.onnx --saveEngine=gpu.engine --fp16

Deps version

CUDA and TensorRT from Drive Orin Linux 6.0.5
Cuda: 11.4.291-1
TensorRT: 8.4.12-1

Cmake to build binary

mkdir build & cd build
cmake ..
make -j
cp ../gpu.engine .

Run the binary and wait for 10~50 minutes. will reproduce the issue.

cd build
./main

Would you please take a look, and give some advise. thanks.
tensorrt_stuck_sample.tar (27.3 MB)

user88818 · March 7, 2023, 5:41am

Hi I also attached compiled binary file. here. thanks
main (218.0 KB)

user88818 · March 8, 2023, 4:37pm

Hi @VickNV @SivaRamaKrishnaNV
is there any feedback some feedback? is tensorRT supported by default on drive Orin system? thanks.

VickNV · March 8, 2023, 10:34pm

I am unable to run the program as shown below. Would you be able to assist in checking it?

nvidia@tegra-ubuntu:~$ ./main
Load engine from :gpu.engine
main: /home/nvidia/liangchang/workspace/sampletest2/stdlog_sample/main.cpp:144: Task::Task(const char*, int, int, int): Assertion `engine’ failed.
Aborted

user88818 · March 8, 2023, 11:44pm

Hi @VickNV
thanks for your reply, Sorry I didn’t explain clearly before.
you will need put the file named “gpu.engine” into the same directory, this file includeded in previous tar.
here attach it again. you cd to extracted directory, then run main directly.
tensorrt_stuck_sample.tgz (25.1 MB)

/tmp/tensorrt_stuck_sample$ ls
CMakeLists.txt  gpu.engine  main  main.cpp  run.sh  tools.h

VickNV · March 8, 2023, 11:51pm

I still see the problem. Please help check it. Thanks.

nvidia@tegra-ubuntu:~/tensorrt_stuck_sample$ ls -l
total 28160
-rw-rw-r–. 1 nvidia nvidia 613 Mar 6 11:48 CMakeLists.txt
-rw-rw-r–. 1 nvidia nvidia 28588849 Feb 23 12:42 gpu.engine
-rwxr-xr-x. 1 nvidia nvidia 223248 Mar 8 23:32 main
-rw-rw-r–. 1 nvidia nvidia 7329 Mar 6 11:42 main.cpp
-rw-rw-r–. 1 nvidia nvidia 59 Mar 6 11:37 run.sh
-rw-rw-r–. 1 nvidia nvidia 2692 Mar 6 10:43 tools.h
nvidia@tegra-ubuntu:~/tensorrt_stuck_sample$ ./main
Load engine from :gpu.engine
main: /home/nvidia/liangchang/workspace/sampletest2/stdlog_sample/main.cpp:144: Task::Task(const char*, int, int, int): Assertion `engine’ failed.
Aborted

user88818 · March 9, 2023, 12:00am

hmm. I can run it without problem. but let me check, and I will get back to you later

nvidia@tegra-ubuntu:~/tensorrt_stuck_sample$ ls -l
total 28160
-rw-rw-r--. 1 nvidia nvidia      613 Mar  6 11:48 CMakeLists.txt
-rw-rw-r--. 1 nvidia nvidia 28588849 Feb 23 12:42 gpu.engine
-rwxr-xr-x. 1 nvidia nvidia   223248 Mar  8 23:32 main
-rw-rw-r--. 1 nvidia nvidia     7329 Mar  6 11:42 main.cpp
-rw-rw-r--. 1 nvidia nvidia       59 Mar  6 11:37 run.sh
-rw-rw-r--. 1 nvidia nvidia     2692 Mar  6 10:43 tools.h
nvidia@tegra-ubuntu:~/tensorrt_stuck_sample$ ./main
Load engine from :gpu.engine
Wednesday Wed Mar  8 23:57:58 2023 init log
Load engine from :gpu.engine
Wednesday Wed Mar  8 23:58:00 2023 init log
Load engine from :gpu.engine
Wednesday Wed Mar  8 23:58:02 2023 init log
Load engine from :gpu.engine
Wednesday Wed Mar  8 23:58:03 2023 init log
Load engine from :gpu.engine
Wednesday Wed Mar  8 23:58:05 2023 init log
....
Thursday Thu Mar  9 00:03:16 2023 enqueue 1
FPS:    7.99961 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.
33292   8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292
Thursday Thu Mar  9 00:03:16 2023 enqueue 1

user88818 · March 9, 2023, 12:13am

Hi @VickNV
Can you tell us your libnvinfer version use the following command line. very likely this caused by gpu.engine doesn’t match your libnvinfer version.
We plan to provide another mechanism to dynamic generate gpu.enginer to avoid this issue. thanks.

nvidia@tegra-ubuntu:~/tensorrt_stuck_sample$ ldd main | grep libnvinfer
        libnvinfer.so.8 => /lib/aarch64-linux-gnu/libnvinfer.so.8 (0x0000ffff7b784000)
nvidia@tegra-ubuntu:~/tensorrt_stuck_sample$ ls -l $(ldd main | grep libnvinfer | awk '{print $3}')
lrwxrwxrwx. 1 root root 20 Sep 21 22:02 /lib/aarch64-linux-gnu/libnvinfer.so.8 -> libnvinfer.so.8.4.12

const_ride · March 9, 2023, 2:28am

I’m collaborator of @user88818 , Isn’t there only one version of tensorrt for drive os 6.0.5 linux?
Cuda:11.4.291-1
TensorRT: 8.4.12-1
The model engine converted under the above environment.

Or you can convert a gpu.engine by

wget https://github.com/onnx/models/raw/main/vision/classification/resnet/model/resnet50-v2-7.onnx
/usr/src/tensorrt/bin/trtexec --onnx=resnet50-v2-7.onnx --saveEngine=gpu.engine --fp16

thanks.

user88818 · March 10, 2023, 5:41pm

Hi @VickNV @SivaRamaKrishnaNV
have you got a chance to followup this. Thanks.

VickNV · March 10, 2023, 8:37pm

After flashing to 6.0.5, I experienced a system hang when running your program. However, I did not see any unusual output. Have you encountered any messages related to this program, as you mentioned earlier in this thread?

user88818 · March 10, 2023, 9:11pm

Hi @VickNV
the error output comes from console. /dev/ttyACM0 and /dev/ttyACM1. and the error msg I attached in very beginning of this thread. and those error not related with the source code. e.g.
from ( /dev/ttyACM0)

fatal error no data cpu:0 err:-4
fatal error no data cpu:1 err:-4
fatal error no data cpu:2 err:-4
fatal error no data cpu:3 err:-4
fatal error no data cpu:4 err:-4

sudo minicom -w -D /dev/ttyACM1
sudo minicom -w -D /dev/ttyACM0

VickNV · March 10, 2023, 10:24pm

Is the console log you provided complete? Can you please share the full console log? I have been running your program for the past two hours for the second time and haven’t experienced any system hang.

user88818 · March 11, 2023, 12:41am

HI @VickNV
I just tried twice.
1st testing lasted about 90minutes. not reproduce.
2nd testing lasted about 15 minutes. reproduced system stuck. after reproached, even the ttyACM0(ubuntu console) lost response. I attached two log here.
ttyACM1.log (5.5 KB)
ttyACM0.log (6.3 KB)

my experience, during reproduce step. if you see some print like this. FPS all 0. then you can’t reproduce anymore. just kill the program, then relaunch it.

VickNV · March 11, 2023, 1:20am

Could you provide more logs specifically around the time when the issue occurred? The messages you provided seem to appear even before the issue occurs.

Topic		Replies	Views
System hang occasionally by nvidia driver DRIVE AGX Orin General driveos	8	1175	March 8, 2023
Jetson AGX Orin (JetPack 6.2.1): silent GPU hang - host1x interrupt servicing stalls under sustained compute, reproduces on two distinct Orin systems Jetson AGX Orin tensorrt , kernel , deepstream	19	356	June 30, 2026
Rcu_preempt caused by cuda-EvtHandlr? Jetson Orin NX cuda , nvbugs	110	2694	April 22, 2026
Orin does not boot anymore "Error: VRS10 or VRS11 interrupt is detected, power sequence stops" DRIVE AGX Orin General drive-platform-setup	12	440	January 20, 2025
GPU driver abnormality, repeatedly entering the desktop with flashing screen Jetson AGX Orin graphics	4	207	May 15, 2025
We are observing the Drive Orin boot up issue DRIVE AGX Orin General drive-platform-bootup	33	3461	October 5, 2023
Drive orin fails to start after refresh DRIVE AGX Orin General drive-platform-bootup	8	1418	February 14, 2023
VPN connection cause Orin system hang Jetson AGX Orin ubuntu	50	2641	February 28, 2023
Orin camera stop streaming occasionally kernel panic Jetson Orin NX camera , kernel	5	297	September 24, 2024
GPU Hangs When Using OpenCV on the Jetson TX-1 Jetson TX1	12	2057	March 13, 2017

Drive Orin System hang

Sample description

Deps version

Related topics