JetPack 4.6.3: dmesg hangs the kernel and the device reboots

Hey,

@suhash - for the NVidia staff:

I recently updated our images to from JetPack 4.5.x → 4.6.3.

When the preemp-rt patches are applied to the kernel (./rt-patch.sh apply-patches), calling dmesg will cause the kernel to lock and the device reboots after few seconds.

Using journalctl -k works fine.

I see this with TX2 and Xavier AGX devkits.

@madisox hinted at some changes in kernel/printk 32.7.2 → 32.7.3 (preempt-rt: dmesg causes device hang and reboot · Issue #1165 · OE4T/meta-tegra · GitHub). Could there be a patch missing for preempt-rt?

Hi,
Please check if below steps are correct:

  1. You have Xavier in Jetpack 4.5(or 4.5.1) with RT kernel image
  2. You build OTA images on Jetpack 4.6.3 with RT kernel image
  3. After upgrading the Xavier through OTA update, the system cannot boot up

Would like to make sure we understand the use-case. Please help check and confirm it.

The problem isn’t with OTA. I actually simply re-flashed the device for the test.

The problem is typing dmsg command from the serial console or SSH hangs the kernel (the command never returns or print anything) and the device reboots. This happens only when the preempt-rt patches have been applied to the kernel sources.

Hi,
Please try Jetpack 4.6(r32.6.1) instead of 4.6.3. See if it works in previous release.

Hi,

I have a build running to verify this with 4.6. But I cannot downgrade, I need 4.6.3 to get support for the Jetson TX2 modules with new Hynix memory. New modules are already on their way.

Are you able to reproduce the issue on your side with 4.6.3?

I can confirm dmesg works with Jetpack 4.6(r32.6.1).

@DaneLLL this turns out being more severe than originally thought. Simply stepping through the code with remote GDB makes the device hang and reboot. Recompiling the kernel without preempt rt patches makes the problem disappear, but I need the patches for our application.

I tried with both TX2 and Xavier AGX devkits.

Hi,
We are investigating the issue. Will update if there is further finding.

@DaneLLL thanks for looking into this.

Alternatively, which sw component should be changed to bring support for the new Hynix memory? cboot only? If so, would it be feasible to apply the patches only on top of JetPack 4.5.1 cboot sources?

Hi,
We are not able to support new modules in previous release(s). Will try to check this on latest release and update.

Hi @DaneLLL,

Are you able to reproduce the issue on your side?

Hi @DaneLLL, do you have any update to share on this issue?

Hi,
It is under investigation. There is no fix for this yet.

Thanks for the update

I don’t know if this will help or not, but here is a wild possibility to get more information, and you’d have to do this over serial console so that it logs to the host PC:

sudo -s
strace dmesg -DDD -I 4 -f 1>/dev/null

(be sure to enable serial console log before starting)

Hi,

we currently have a temporary solution:
Please comment out the following line in Linux_for_Tegra/source/public/kernel/kernel-4.9/kernel/printk/printk.c

error = syslog_print_all(buf, len, clear);

it’s line 1492 on my side, and inside function do_syslog(), make sure to allocate it correctly if the line number is different. After the change, follow the developer guide to apply rt patches and build the kernel.
https://docs.nvidia.com/jetson/archives/l4t-archived/l4t-3273/index.html#page/Tegra%20Linux%20Driver%20Package%20Development%20Guide/kernel_custom.html

The device should boot up correctly without hanging up, and display should also be working. But note that there are still two limitations (hence temporary):

  1. Use either headless mode or a 1080P monitor, as there’s still some compatibility issues with 4K monitors. Not sure if 2K works, you may try it yourself.
  2. Do not change resolution when connected to a real monitor, the Ubuntu Settings APP will crash when you tryto access the display section.

Hi @DaveYYY
I have tried the suggested solution. It does not solve any of the issues.

I understand that silencing dmseg and basically disabling its functionality prevents the device from locking and rebooting. But since dmesg doesn’t return anything, it makes it completely useless.

The kernel is locking somewhere else which I believe isn’t related to dmesg.

Try connecting a SATA drive to the TX2 devkit and run the smart-log --test command followed by smart-log. A lock will occur and you will lose the access to the disk.

Try running gdbserver and step through some code. The kernel randomly locks up and the device will reboot.

Did you try strace from a serial console?

Sorry for the late response, have you managed to get issue resolved or still need the support? Thanks

Hi @kayccc

Yes we do still need help with this one. The kernel locks in several places with nothing reported to the console.

The easiest way to reproduce is just to call dmesg. But gdbserver, smartctl and other utilities can get it locked too.

Are you able to reproduce the issue on your side with the instructions I previously gave?

Thanks