Drive Orin System hang

that’s all logs from start ./main to issue reproduced. I can’t say which one is around the time then system got hang. as the hang happens randomly, and those log doesn’t have a timestamp.
but we noticed:

  • every time the system hand we can fond those suspicious log from console. e.g.
ASSERT: /dvs/git/dirty/git-master_foundation/3rdparty/arm/arm-trusted-firmware/../arm-trusted-firmware-private/plat/nvidia/tegra/soc/t234/plat_sip_calls.c
fatal error no data cpu:0 err:-4
...
fatal error no data cpu:11 err:-4
  • even the ubuntu console form ttyACM1 got hang
  • check the last kernel log. there are crash, no suspicious log either.
[  760.462281] sched: RT throttling activated
[  793.457722] hrtimer: interrupt took 2644328960 ns
 # I use power button to reboot here. 
[    0.000000] Booting Linux on physical CPU 0x0000010000 [0x410fd421]

with all the log and information, we can’t find any clue here. we hope Nvidia can help to figure out some clue. thanks.

It appears that the system is restricting the CPU time allocated to non-real-time processes in order to prioritize real-time processes. Please try modifying the /etc/systemd/journald.conf file by setting “Storage=none” to reduce on-real-time tasks and see if it helps with the problem.

I changed /etc/systemd/journald.conf file and setting “Storage=none” . the testing is running now. I will update here once it reproduced.

But those two logs print way before the system hang. I posted this to explain when system hang, there are no log,

[  760.462281] sched: RT throttling activated
[  793.457722] hrtimer: interrupt took 2644328960 ns
 # I use power button to reboot here. 
[    0.000000] Booting Linux on physical CPU 0x0000010000 [0x410fd421]

HI @VickNV
I reproduced again with /etc/systemd/journald.conf file by setting “Storage=none”.
but this time, I got log from /var/log/kern.log after I power cycle.

Oct 21 15:55:28 tegra-ubuntu kernel: [  149.267236] hrtimer: interrupt took 1585077344 ns
Oct 21 15:55:35 tegra-ubuntu kernel: [  154.437440] sched: RT throttling activated
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966021] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966030] rcu:        10-....: (6 GPs behind) idle=782/1/0x4000000000000002 softirq=0/0 fqs=1896
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966035]     (detected by 1, t=5252 jiffies, g=10605, q=252)
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966038] Task dump for CPU 10:
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966040] task:kworker/u24:3   state:R  running task     stack:    0 pid:  629 ppid:     2 flags:0x0000002a
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966052] Workqueue:  0x0 (events_unbound)
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966057] Call trace:
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966058]  __switch_to+0xc8/0x120
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966065]  __boot_cpu_mode+0x0/0x8
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966070] rcu: ====For debug only: Start Printing Blocked Tasks====<print_other_cpu_stall>
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966230] task:jbd2/vblkdev0p1 state:D stack:    0 pid:  581 ppid:     2 flags:0x00000028
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966232] Call trace:
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966233]  __switch_to+0xc8/0x120
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966236]  __schedule+0x344/0x900
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966241]  schedule+0x64/0x120
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966244]  io_schedule+0x24/0xc0
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966247]  bit_wait_io+0x20/0x60
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966248]  __wait_on_bit+0x80/0xf0
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966250]  out_of_line_wait_on_bit+0xa4/0xd0
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966253]  __wait_on_buffer+0x40/0x50
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966257]  jbd2_journal_commit_transaction+0x1158/0x1e60
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966260]  kjournald2+0xc4/0x270
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966263]  kthread+0x16c/0x1a0
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966266]  ret_from_fork+0x10/0x24
Oct 21 16:09:43 tegra-ubuntu kernel: [ 1004.966480] task:tmux: server    state:D stack:    0 pid: 1872 ppid:     1 flags:0x00000008

Have you rebooted the system after modifying the /etc/systemd/journald.conf file? I haven’t been able to reproduce the issue yet.

Yes, I modified, and then reboot system, then reproduced it.
if you run ./main more than 30 minutes, can’t reproduce it, then you can just stop it. and restart ./main.

Got it. Additionally, have you noticed any improvement in the issue after disabling journal data storage?

util now, I didn’t noticed improvement. thanks.

Hi @VickNV
I reproduced multiple time, with
Storage=none

and default is 950000, I reduced it to 95000

root@tegra-ubuntu:/home/nvidia# echo 95000 > /proc/sys/kernel/sched_rt_runtime_us
root@tegra-ubuntu:/home/nvidia# cat /proc/sys/kernel/sched_rt_runtime_us
95000

got “rcu_preempt detected stalls on CPUs/tasks:” I attached kernel log here.
ttyACM0.log (59.9 KB)

I’m not familiar with journaling, but based on the call traces in this log, I suspect that the journal storage may not have been disabled successfully.

With journal storage disabled, I was not able to reproduce this issue, even after running the program for 3 hours once and 30 minutes three times.

I suspect that the journal storage may not have been disabled successfully.
After reboot. I double checked the configuration

nvidia@tegra-ubuntu:~$ cat /etc/systemd/journald.conf | grep Storage
Storage=none

is that because you saw some call stack related to file system. but on a running system, not only systemd-journald need write data to disk, some other process. e.g. syslogd, sshd. also need write.

BTW, this is another reproduce, and segment. for reproducing. can you try multiple time, if longer than 30minutes, then you can just kill the program(main), and restart it agin.

[ 3223.340168]  ret_from_fork+0x10/0x24
[ 3223.340171] rcu: ====For debug only: End Printing Blocked Tasks====<print_other_cpu_stall>
[ 3227.005397] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #100!!!
[ 3227.005403] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #100!!!
[ 3227.005560] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #100!!!
[ 3227.005562] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #100!!!
[ 3227.005683] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #100!!!
[ 3227.005685] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #100!!!
[ 3227.005750] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #100!!!
[ 3227.005751] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #100!!!
[ 3227.008830] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #180!!!
[ 3227.008831] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #180!!!
ASSERT: /dvs/git/dirty/git-master_foundation/3rdparty/arm/arm-trusted-firmware/../arm-trusted-firmware-private/plat/nvidia/tegra/soc/t234/plat_sip_calls.c <128>
   MPHY0 periodic read back Error. 0x1, 0xdad800, 0x300
[ 3630.044896] INFO: task jbd2/vblkdev0p1:583 blocked for more than 120 seconds.
[ 3630.044905]       Tainted: G           OE     5.10.120-rt70-tegra #2
[ 3630.044907] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3630.044909] task:jbd2/vblkdev0p1 state:D stack:    0 pid:  583 ppid:     2 flags:0x00000028

I am still trying to reproduce the issue. If I am able to terminate the program using Ctrl+C in the ssh console and run it again without any problems, that means the issue has not been reproduced, correct?

Additionally, I would like to know how this issue is currently affecting your development.

If I am able to terminate the program using Ctrl+C in the ssh console and run it again without any problems, that means the issue has not been reproduced, correct?

that’s correct, when the issue reproduced. the ssh lost response.

Additionally, I would like to know how this issue is currently affecting your development.
this program is a super simplified version, the gpu.engine from. the program just repeatedly generator some work tensorrt workload.
in real situation, our system hang frequently, we Cann’t do any meaningful testing or evaluation on our application.

wget https://github.com/onnx/models/raw/main/vision/classification/resnet/model/resnet50-v2-7.onnx
/usr/src/tensorrt/bin/trtexec --onnx=resnet50-v2-7.onnx --saveEngine=gpu.engine --fp16

if I get a setup, and allow you to remote access my drive orin, so you can reproduce it remotely. will that help? thanks.

I still need a way to reproduce the issue so that our team can investigate it. Could you please confirm the detailed steps to reproduce the issue? As far as I understand, the steps are as follows:

  1. Flash DRIVE OS 6.0.5.
  2. Repeatedly run the “main” program for 30 minutes.
  3. If you are unable to terminate the program and run it again on the SSH console, it means the issue has been reproduced.

Additionally, could you please share the output of the following command?

$ cat /etc/nvidia/version-ubuntu-rootfs.txt

Yes, your reproduce step is ok.
here is the results.

nvidia@tegra-ubuntu:~$  cat /etc/nvidia/version-ubuntu-rootfs.txt
6.0.5.0-31732390

Can you still reproduce the issue when running inference repeatedly with a single thread?

Also, I wanted to let you know that I have been unable to reproduce the issue since yesterday afternoon and today. However, I did notice that the “cat /etc/nvidia/version-ubuntu-rootfs.txt” command got stuck, although I was still able to run your program again.

Can you still reproduce the issue when running inference repeatedly with a single thread?

We tried before, we can’t reproduce it with single thread.

However, I did notice that the “cat /etc/nvidia/version-ubuntu-rootfs.txt” command got stuck,

if you connect console to /dev/ttyACM0,you will see some error log print when the command got struck. I also met similar issue before. ssh connection is ok. but some command got stuck.
compare to whole system stuck. this one is not urgent. do you suspect they are related? thanks.

Hi @VickNV
I forgot to mentioned that, with same production code, running on Jetson Orin, we never met this kind system hang issue. We have used Jetson Orin for long time.
and I noticed that

  • Driver orin, linux kernel running on hypervisor.
  • Jetson Orin, Linux kernel running on bare metal hardware

I’m not sure if you can provide a way, that we can do a quick testing, just run linux kernel on Driver orin, without hypervisor.
or maybe hypervisor team can provide some clue?

The latest release of DRIVE OS, 6.0.6, is now available. Could you please confirm if the issue still persists on this version?

HI @VickNV
I tried to upgrade with SDK manager. got this flash error.


I tried to upgrade with command line. got this error. I already power cycle my PC and Drive orin. program still exist. I already upgraded my sdkmanager to latest version.
this is the 1st time I met this issue, I’ve flashed the system multiple time before.

NvShell>tegrarecovery x1 on
Info: Executing cmd: tegrarecovery, argc: 2, args: x1 on
Command Executed
NvShell>tegrareset x1
Info: Executing cmd: tegrareset, argc: 1, args: x1
NvShell>ERROR: MCU_PLTFPWRMGR: Tegra Reset Request failed - Preconditons not met