Drive Orin System hang

Please provide the following info (tick the boxes after creating this topic):
Software Version
[*] DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)
DRIVE OS 6.0.4 SDK
other

Target Operating System
[*] Linux
QNX
other

Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
[*] DRIVE AGX Orin Developer Kit (not sure its number)
other

SDK Manager Version
[*] 1.9.1.10844
other

Host Machine Version
[*] native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other

Running some simple CUDA testing program, after few minutes. whole system hang. ssh lost response. console print following error msg:
searched keywords from SDK source code, seems those message not from kernel. it from firmware?
how to debug this? how to get firmware source code, and build it? thanks.

Time taken from Error Reporting to SEH: 388 microseconds
Time taken from Error Reporting to SEH: 353 microseconds
Time taken from Error Reporting to SEH: 264 microseconds
Time taken from Error Reporting to SEH: 140 microseconds
DemoApp: ErrCode-0x30000004 ReptrId-0x8060 ErrAttr-0x0
EPS TimeStamp: 0xd038ed1c
DemoApp: ErrCode-0x30000008 ReptrId-0x8060 ErrAttr-0x0
EPS TimeStamp: 0xd038fef2
DemoApp: ErrCode-0x30000004 ReptrId-0x8060 ErrAttr-0x0
EPS TimeStamp: 0xd0390f5e
DemoApp: ErrCode-0x89abcdef ReptrId-0x8013 ErrAttr-0x0
EPS TimeStamp: 0xd039242e
process qm error
��
 MPHY0 periodic read back Error. 0x0, 0x0, 0x1010100
 MPHY0 periodic read back Error. 0x1, 0xdad800, 0x400
 MPHY0 periodic read back Error. 0x2, 0x0, 0x1010100
 MPHY0 periodic read back Error. 0x3, 0x0, 0x400
 MPHY1 periodic read back Error. 0x0, 0x0, 0x1010100
 MPHY1 periodic read back Error. 0x1, 0xdad800, 0x400
 MPHY1 periodic read back Error. 0x2, 0x0, 0x1010100
 MPHY1 periodic read back Error. 0x3, 0x0, 0x400
 MPHY0 periodic read back Error. 0x1, 0xdad800, 0x300
 MPHY0 periodic read back Error. 0x3, 0x0, 0x0
 MPHY1 periodic read back Error. 0x1, 0xdad800, 0x300
 MPHY1 periodic read back Error. 0x3, 0x0, 0x0
��iodic read back Error. 0x1, 0xdad800, 0x300_foundation/3rdparty/arm/arm-trusted-firmware/../arm-trusted-firmware-private/plat/nvidia/tegra/soc/t234/plat_sip_calls.c <128>
 MPHY0 periodic read back Error. 0x3, 0x0, 0x0
 MPHY1 periodic read back Error. 0x1, 0xdad800, 0x300
 MPHY1 periodic read back Error. 0x3, 0x0, 0x0

Dear @user88818,
ssh lost response. console print following error msg

Does hard reboot of target fixes the issue? what is the frequency of this error? Do you see these logs on Tegra A console? Does running CUDA sample triggered this issue? If you can not access Tegra console after reboot, check running aurixreset on aurix console.

Hi, thanks for you response.

  1. Yes hard reboot(power off → power on) will fix the issue.
  2. it is easy to reproduce.
  3. those logs from /dev/ttyACM0, the console I use to login to Ubuntu.
  4. we write a simple CUDA program, if you need, we may provide the simple program.
  5. before I reboot, the ubuntu system hang, but I still can access /dev/ttyACM1. I can use poweroff → poweron command to reboot ubuntu system.
    some times, I got following error from ubuntu kernel ( /dev/ttyACM0)
[ 1285.702785] task:utempter        state:D stack:    0 pid: 3626 ppid:  1925 flags:0x00000000
[ 1285.702787] Call trace:
[ 1285.702787]  __switch_to+0xc8/0x120
[ 1285.702790]  __schedule+0x344/0x900
[ 1285.702793]  schedule+0x64/0x120
[ 1285.702795]  wait_transaction_locked+0x88/0xe0
[ 1285.702796]  add_transaction_credits+0x58/0x310
[ 1285.702798]  start_this_handle+0xfc/0x4d0
[ 1285.702799]  jbd2__journal_start+0x128/0x280
[ 1285.702800]  __ext4_journal_start_sb+0x1ac/0x1d0
[ 1285.702803]  ext4_dirty_inode+0x50/0xa0
[ 1285.702805]  __mark_inode_dirty+0x1fc/0x500
[ 1285.702807]  generic_update_time+0x74/0x100
[ 1285.702808]  inode_update_time+0x58/0x70
[ 1285.702810]  file_update_time+0xe0/0x120
[ 1285.702811]  file_modified+0x38/0x50
[ 1285.702813]  ext4_buffered_write_iter+0x6c/0x180
[ 1285.702814]  ext4_file_write_iter+0x64/0x7a0
[ 1285.702816]  new_sync_write+0xfc/0x1a0
[ 1285.702818]  vfs_write+0x2a4/0x3d0
[ 1285.702821]  ksys_write+0x7c/0x110
[ 1285.702822]  __arm64_sys_write+0x28/0x40
[ 1285.702824]  el0_svc_common.constprop.0+0x84/0x1d0
[ 1285.702826]  do_el0_svc+0x38/0xb0
[ 1285.702829]  el0_svc+0x1c/0x30
[ 1285.702831]  el0_sync_handler+0xa8/0xb0
[ 1285.702834]  el0_sync+0x16c/0x180
[ 1285.702835] rcu: ====For debug only: End Printing Blocked Tasks====<print_other_cpu_stall>

some more log from ( /dev/ttyACM0), not sure if it is related.

fatal error no data cpu:0 err:-4
fatal error no data cpu:1 err:-4
fatal error no data cpu:2 err:-4
fatal error no data cpu:3 err:-4
fatal error no data cpu:4 err:-4
fatal error no data cpu:5 err:-4
fatal error no data cpu:6 err:-4
fatal error no data cpu:7 err:-4
fatal error no data cpu:8 err:-4
fatal error no data cpu:9 err:-4
fatal error no data cpu:10 err:-4
fatal error no data cpu:11 err:-4
process non fatal
non fatal error cpu:1 1
non fatal error cpu:1 2
non fatal error cpu:1 3

Time taken from Error Reporting to SEH: 389 microseconds
Time taken from Error Reporting to SEH: 316 microseconds

Can you attempt to replicate the issue by running any CUDA sample application, as @SivaRamaKrishnaNV suggested?

Hi we were able to reproduce the issue with a simplified tensorrt program. I’ve attached source code here.

Sample description

/usr/src/tensorrt/bin/trtexec --onnx=resnet50-v2-7.onnx --saveEngine=gpu.engine --fp16

Deps version

CUDA and TensorRT from Drive Orin Linux 6.0.5
Cuda: 11.4.291-1
TensorRT: 8.4.12-1

  1. Cmake to build binary
mkdir build & cd build
cmake ..
make -j
cp ../gpu.engine .
  1. Run the binary and wait for 10~50 minutes. will reproduce the issue.
cd build
./main

Would you please take a look, and give some advise. thanks.
tensorrt_stuck_sample.tar (27.3 MB)

Hi I also attached compiled binary file. here. thanks
main (218.0 KB)

Hi @VickNV @SivaRamaKrishnaNV
is there any feedback some feedback? is tensorRT supported by default on drive Orin system? thanks.

I am unable to run the program as shown below. Would you be able to assist in checking it?

nvidia@tegra-ubuntu:~$ ./main
Load engine from :gpu.engine
main: /home/nvidia/liangchang/workspace/sampletest2/stdlog_sample/main.cpp:144: Task::Task(const char*, int, int, int): Assertion `engine’ failed.
Aborted

Hi @VickNV
thanks for your reply, Sorry I didn’t explain clearly before.
you will need put the file named “gpu.engine” into the same directory, this file includeded in previous tar.
here attach it again. you cd to extracted directory, then run main directly.
tensorrt_stuck_sample.tgz (25.1 MB)

/tmp/tensorrt_stuck_sample$ ls
CMakeLists.txt  gpu.engine  main  main.cpp  run.sh  tools.h

I still see the problem. Please help check it. Thanks.

nvidia@tegra-ubuntu:~/tensorrt_stuck_sample$ ls -l
total 28160
-rw-rw-r–. 1 nvidia nvidia 613 Mar 6 11:48 CMakeLists.txt
-rw-rw-r–. 1 nvidia nvidia 28588849 Feb 23 12:42 gpu.engine
-rwxr-xr-x. 1 nvidia nvidia 223248 Mar 8 23:32 main
-rw-rw-r–. 1 nvidia nvidia 7329 Mar 6 11:42 main.cpp
-rw-rw-r–. 1 nvidia nvidia 59 Mar 6 11:37 run.sh
-rw-rw-r–. 1 nvidia nvidia 2692 Mar 6 10:43 tools.h
nvidia@tegra-ubuntu:~/tensorrt_stuck_sample$ ./main
Load engine from :gpu.engine
main: /home/nvidia/liangchang/workspace/sampletest2/stdlog_sample/main.cpp:144: Task::Task(const char*, int, int, int): Assertion `engine’ failed.
Aborted

hmm. I can run it without problem. but let me check, and I will get back to you later

nvidia@tegra-ubuntu:~/tensorrt_stuck_sample$ ls -l
total 28160
-rw-rw-r--. 1 nvidia nvidia      613 Mar  6 11:48 CMakeLists.txt
-rw-rw-r--. 1 nvidia nvidia 28588849 Feb 23 12:42 gpu.engine
-rwxr-xr-x. 1 nvidia nvidia   223248 Mar  8 23:32 main
-rw-rw-r--. 1 nvidia nvidia     7329 Mar  6 11:42 main.cpp
-rw-rw-r--. 1 nvidia nvidia       59 Mar  6 11:37 run.sh
-rw-rw-r--. 1 nvidia nvidia     2692 Mar  6 10:43 tools.h
nvidia@tegra-ubuntu:~/tensorrt_stuck_sample$ ./main
Load engine from :gpu.engine
Wednesday Wed Mar  8 23:57:58 2023 init log
Load engine from :gpu.engine
Wednesday Wed Mar  8 23:58:00 2023 init log
Load engine from :gpu.engine
Wednesday Wed Mar  8 23:58:02 2023 init log
Load engine from :gpu.engine
Wednesday Wed Mar  8 23:58:03 2023 init log
Load engine from :gpu.engine
Wednesday Wed Mar  8 23:58:05 2023 init log
....
Thursday Thu Mar  9 00:03:16 2023 enqueue 1
FPS:    7.99961 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.
33292   8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292 8.33292
Thursday Thu Mar  9 00:03:16 2023 enqueue 1

Hi @VickNV
Can you tell us your libnvinfer version use the following command line. very likely this caused by gpu.engine doesn’t match your libnvinfer version.
We plan to provide another mechanism to dynamic generate gpu.enginer to avoid this issue. thanks.

nvidia@tegra-ubuntu:~/tensorrt_stuck_sample$ ldd main | grep libnvinfer
        libnvinfer.so.8 => /lib/aarch64-linux-gnu/libnvinfer.so.8 (0x0000ffff7b784000)
nvidia@tegra-ubuntu:~/tensorrt_stuck_sample$ ls -l $(ldd main | grep libnvinfer | awk '{print $3}')
lrwxrwxrwx. 1 root root 20 Sep 21 22:02 /lib/aarch64-linux-gnu/libnvinfer.so.8 -> libnvinfer.so.8.4.12

1 Like

I’m collaborator of @user88818 , Isn’t there only one version of tensorrt for drive os 6.0.5 linux?
Cuda:11.4.291-1
TensorRT: 8.4.12-1
The model engine converted under the above environment.

Or you can convert a gpu.engine by

wget https://github.com/onnx/models/raw/main/vision/classification/resnet/model/resnet50-v2-7.onnx
/usr/src/tensorrt/bin/trtexec --onnx=resnet50-v2-7.onnx --saveEngine=gpu.engine --fp16

thanks.

Hi @VickNV @SivaRamaKrishnaNV
have you got a chance to followup this. Thanks.

After flashing to 6.0.5, I experienced a system hang when running your program. However, I did not see any unusual output. Have you encountered any messages related to this program, as you mentioned earlier in this thread?

Hi @VickNV
the error output comes from console. /dev/ttyACM0 and /dev/ttyACM1. and the error msg I attached in very beginning of this thread. and those error not related with the source code. e.g.
from ( /dev/ttyACM0)

fatal error no data cpu:0 err:-4
fatal error no data cpu:1 err:-4
fatal error no data cpu:2 err:-4
fatal error no data cpu:3 err:-4
fatal error no data cpu:4 err:-4
sudo minicom -w -D /dev/ttyACM1
sudo minicom -w -D /dev/ttyACM0

Is the console log you provided complete? Can you please share the full console log? I have been running your program for the past two hours for the second time and haven’t experienced any system hang.

HI @VickNV
I just tried twice.
1st testing lasted about 90minutes. not reproduce.
2nd testing lasted about 15 minutes. reproduced system stuck. after reproached, even the ttyACM0(ubuntu console) lost response. I attached two log here.
ttyACM1.log (5.5 KB)
ttyACM0.log (6.3 KB)

my experience, during reproduce step. if you see some print like this. FPS all 0. then you can’t reproduce anymore. just kill the program, then relaunch it.

Could you provide more logs specifically around the time when the issue occurred? The messages you provided seem to appear even before the issue occurs.