JetPack 4.6.3 preempt-rt patkernel: reboot loop

Hi,

sorry for the late reply.

Can you show the log when the booting hang happens?
Like where does it hang at?

What do you get with these two testing programs?

Detailed stack trace of the hang:

[  247.466875] INFO: task nvpmodel:4534 blocked for more than 120 seconds.
[  247.473483]       Tainted: G        W       4.9.337-airstack0.5.0-tegra #1
[  247.480351] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  247.488173] nvpmodel        D    0  4534   4527 0x00000108
[  247.488180] Call trace:
[  247.488184] [<00000000bd2ea222>] __switch_to+0xb0/0xd8
[  247.488187] [<000000001f49c4ac>] __schedule+0x288/0x720
[  247.488191] [<00000000091997ec>] schedule+0x40/0xd8
[  247.488197] [<00000000660ce9d2>] blk_mq_freeze_queue_wait+0x6c/0xb8
[  247.488201] [<0000000021b9e1f2>] blk_mq_queue_reinit_work+0x78/0x128
[  247.488204] [<00000000c79f1e82>] blk_mq_queue_reinit_dead+0x24/0x30
[  247.488208] [<00000000f2b9f360>] cpuhp_invoke_callback+0x120/0x990
[  247.488212] [<00000000eac6a39f>] cpuhp_down_callbacks+0x60/0xb0
[  247.488217] [<0000000046c20d34>] _cpu_down+0x298/0x3b0
[  247.488220] [<0000000055e6140d>] do_cpu_down+0x88/0x2d0
[  247.488224] [<00000000bf146c97>] cpu_down+0x24/0x30
[  247.488229] [<000000006a5dfcf2>] cpu_subsys_offline+0x20/0x30
[  247.488236] [<000000003b968a8a>] device_offline+0x84/0xd8
[  247.488239] [<00000000037b72fd>] online_store+0x4c/0xa0
[  247.488243] [<00000000d878b6de>] dev_attr_store+0x44/0x60
[  247.488247] [<00000000768ad4d4>] sysfs_kf_write+0x54/0x78
[  247.488250] [<00000000d15977e9>] kernfs_fop_write+0xc0/0x1d8
[  247.488254] [<00000000136dba23>] __vfs_write+0x48/0x118
[  247.488257] [<0000000082a80314>] vfs_write+0xac/0x1b0
[  247.488261] [<00000000b3924b09>] SyS_write+0x5c/0xc8
[  247.488264] [<000000000f9182a9>] __sys_trace_return+0x0/0x4

rt-tests issues here:

sudo rt-migrate-test -c -p 60
|--------------------------------                                      |
Iter:      0       1       2       3       4  
   0:    20063      74      60      51      51  
 len:    40063   20074   20060   20051   20051  
 loops: 307619  305819  305853  305903  306213  

   1:    20026      48      29      23      22  
 len:    40026   20048   20029   20023   20022  
 loops: 309298  303135  302968  303058  303221  

   2:    20030      66      31      29      24  
 len:    40030   20066   20031   20029   20024  
 loops: 307487  305640  305451  305440  305810  

   3:    20025      47      24      29      21  
 len:    40025   20047   20024   20029   20021  
 loops: 307629  305775  305791  305696  305856  

   4:    20027      33      27      26      21  
 len:    40027   20033   20028   20026   20021  
 loops: 307591  305281  305263  305423  305331  

   5:    20026      31      27      22      20  
 len:    40026   20031   20028   20022   20020  
 loops: 307736  305616  305725  305865  305893  

   6:    20025      31      26      22      20  
 len:    40025   20032   20026   20022   20020  
 loops: 307783  305632  305728  305850  305897  

   7:    20022      29      24      20      18  
 len:    40022   20029   20024   20020   20018  
 loops: 307583  305495  305460  305407  305516  

   8:    20025      32      27      22      20  
 len:    40025   20032   20027   20022   20020  
 loops: 307690  304798  304774  304842  304821  

   9:       30      58      25      24      19  
 len:    20030   20058   20025   20025   20019  
 loops:     45  305660  305301  305428  305361  

  10:    20026      33      27      25      21  
 len:    40026   20033   20027   20025   20021  
 loops: 307709  305673  305726  305846  305855  

  11:       48      68      41      43      36  
 len:    20048   20068   20041   20043   20036  
 loops:     70  305823  305547  305776  305739  

  12:       29      43      24      21      18  
 len:    20029   20043   20025   20021   20018  
 loops:     50  306162  306145  306249  306428  

  13:    20024      31      26      23      19  
 len:    40024   20031   20026   20023   20019  
 loops: 308037  305564  305635  305788  305807  

  14:    20031      37      28      27      25  
 len:    40031   20037   20028   20027   20025  
 loops: 307945  305265  305350  305368  305524  

  15:    20023      31      27      22      19  
 len:    40023   20031   20027   20022   20019  
 loops: 307769  305013  305167  305239  305352  

  16:       32      47      27      24      21  
 len:    20032   20047   20027   20024   20021  
 loops:     46  305516  305432  305587  305596  

  17:    20022      44      24      21      18  
 len:    40022   20044   20024   20021   20018  
 loops: 307355  305872  305663  305744  305721  

  18:    20026      31      25      22      20  
 len:    40026   20031   20026   20022   20020  
 loops: 307634  305755  305772  305858  305801  

  19:    20025      31      25      24      20  
 len:    40025   20032   20025   20024   20020  
 loops: 307923  305559  305475  305746  305837  

  20:       32      52      24      22      21  
 len:    20032   20052   20024   20022   20021  
 loops:     48  305701  305489  305570  305571  

  21:    20024      30      25      22      20  
 len:    40024   20031   20025   20022   20020  
 loops: 307851  305714  305665  305686  305906  

  22:    20026      30      25      21      20  
 len:    40026   20030   20025   20021   20020  
 loops: 307738  305816  305794  305917  305820  

  23:       43   20025      27      24      21  
 len:    20043   40025   20027   20024   20021  
 loops: 305728  307954  305594  305419  305628  

Parent pid: 7538
 Task 0 (prio 60) (pid 7539):
   Max: 20063 us
   Min: 29 us
   Tot: 360710 us
   Avg: 15029 us

 Task 1 (prio 61) (pid 7540):
   Max: 20025 us
   Min: 29 us
   Tot: 20982 us
   Avg: 874 us

 Task 2 (prio 62) (pid 7541):
   Max: 60 us
   Min: 24 us
   Tot: 675 us
   Avg: 28 us

 Task 3 (prio 63) (pid 7542):
   Max: 51 us
   Min: 20 us
   Tot: 609 us
   Avg: 25 us

 Task 4 (prio 64) (pid 7543):
   Max: 51 us
   Min: 18 us
   Tot: 535 us
   Avg: 22 us

 Failed!

and (note the max happens once every twenty minutes or so of runtime, showing me that latencies aren’t necessarily guaranteed):

sudo sigwaittest -t 4 -f
#0: ID7624, P0, CPU3, I1000; #1: ID7625, P0, CPU0, Cycles 2481672
#2: ID7626, P0, CPU5, I1500; #3: ID7627, P0, CPU4, Cycles 1706388
#4: ID7628, P0, CPU5, I2000; #5: ID7629, P0, CPU4, Cycles 1300657
#6: ID7630, P0, CPU3, I2500; #7: ID7631, P0, CPU4, Cycles 1051301
#1 -> #0, Min    4, Cur   10, Avg   16, Max 5413
#3 -> #2, Min    4, Cur   19, Avg   17, Max  360
#5 -> #4, Min    4, Cur   17, Avg   18, Max 3104
#7 -> #6, Min    5, Cur   18, Avg   18, Max  301

Possibly related issue?: R32.7.1 / 4.9.253-rt168 : INFO: possible circular locking dependency detected (nvpmodel: all_q_mutex + &hp->lock) - #8 by SiarheiLiakh

Hi @DaveYYY do you have an update on the investigation on your side?

Thanks

Hi @DaveYYY @suhash

Are you investigating this issue at NVidia?

Hi,

Sorry for combing back months after. We’ve been flooded with JetPack 6 issues recently.

As you said earlier that you cannot stay at 4.6.2 because of the new Hynix DRAM chips, I’m assuming that you don’t need new features other than the PCN change; therefore, can you please try the pre-built RT kernel on our APT server?

I just checked that both nvidia-l4t-rt-kernel and nvidia-l4t-rt-kernel-headers still stay at 4.9.253-rt168-tegra, which is the same as what 4.6.1 uses.

Like flashing devices with the default kernel from 4.6.4, and then install the RT kernel as instructed in:
https://docs.nvidia.com/jetson/archives/l4t-archived/l4t-3274/index.html#page/Tegra%20Linux%20Driver%20Package%20Development%20Guide/kernel_custom.html#

It boots up successfully on both AGX Xavier and TX2 that are originally flashed with 4.6.4.

Issue is still present and totally reproducible on a TX2 dev kit. Dev kit was flashed with 4.6.4 using your GUI tools, then we applied the RT kernel as shown in this post.

After seven successful boot sequences, the eighth hangs as I’ve seen since this past September.

Issue is totally reproducible using Nvidia hardware, pre-built kernel, and flashing tools.

Please advise.

@DaveYYY please check this patch [PATCH] printk::syslog_print_all out label logbuf_unlock_irq by lfdmn · Pull Request #43 · OE4T/linux-tegra-4.9 · GitHub fixing an oupsy in printk.

It seems to solve the locks with dmseg (using syslog interface), smartctl and gdbserver that we have been having. I’ll run more thorough testing in the next few days.

It does not help with nvpmodel and hot plugging the cpu cores. We use MAXN so it’s not a problem for us, but it will be for anyone needing to do power management.

Hi all,

We can also re-produce this issue (booting hangs with Denver cores disabled) on our side, and we will be checking it.
However, we don’t really have the bandwidth now for fixing an issue that is only present on JetPack 4, and the final release of JetPack 4 (4.6.5) is scheduled to be released soon, so it may not catch it even if it’s fixed. So please use either the MAXN, MAXP CORE ALL, or some custom power config that enables all the cores as a workaround for now.

@DaveYYY any chance you could merge this patch in 4.6.5 or fix the printk patching script?

1 Like

@DaveYYY The TX2i is not EOL until April 2028.

Are you saying that you are going to release a version of JetPack 5 that will work for the TX2i, or that NVIDIA is no longer supporting JetPack for the TX2i through EOL?

I think it’s enough if we put this on the eLinux page:
https://elinux.org/Jetson/L4T/r32.7.x_patches

That’s the case.

@DaveYYY, we created a product a while back that uses the TX2i and set the EOL of our product commensurate with the TX2i EOL. Just so I am clear for future product designs, how does NVIDIA view software support for products that are still in the active life-cycle? Do you have something similar to what Ubuntu publishes so this mistake isn’t made again?

https://ubuntu.com/about/release-cycle

Post-flashing patches are great for hobbyists or researchers that use the dev-kit, but if companies are integrating NVIDIA Jetsons into embedded products, these patches are not sufficient. Our customers don’t have access to the console to apply these patches.

EOL until 2028 just means you can still buy the product until 2028, and doesn’t mean we are still releasing software updates until 2028.