The rt kernel will restart

hi, nvidia team.

We are using Jetpack 5.1.2 RT patch:

./kernel-5.10/scripts/rt-patch.sh apply-patches

I noticed that the system will randomly restart,and there were no abnormalities in the debug log during the reboot.

orin-master login: nvidia^M^M^M
Password: ^M^M
Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 5.10.120-rt70-tegra aarch64)^M^M
^M^M
 * Documentation:  https://help.ubuntu.com^M^M
 * Management:     https://landscape.canonical.com^M^M
 * Support:        https://ubuntu.com/advantage^M^M
^M^M
This system has been minimized by removing packages and content that are^M^M
not required on a system that users do not log into.^M^M
^M^M
To restore this content, you can run the 'unminimize' command.^M^M
^M^M
Expanded Security Maintenance for Applications is not enabled.^M^M
^M^M
18 updates can be applied immediately.^M^M
To see these additional updates run: apt list --upgradable^M^M
^M^M
60 additional security updates can be applied with ESM Apps.^M^M
Learn more about enabling ESM Apps service at https://ubuntu.com/esm^M^M
^M^M
^M^M
The list of available updates is more than a week old.^M^M
To check for new updates run: sudo apt update^M^M
Last login: ä¸<89> 10æ<9c><88> 25 11:54:13 CST 2023 from 10.27.87.242 on pts/0^M^M
nvidia@orin-master:~$ ^M^M
nvidia@orin-master:~$ ^M^M
nvidia@orin-master:~$ ^M^M
nvidia@orin-master:~$ start^M^M
-bash: start: command not found^M^M
nvidia@orin-master:~$ ^@ÿâ^M
[0000.062] I> MB1 (version: 1.2.0.0-t234-54845784-562369e5)^M
[0000.067] I> t234-A01-0-Silicon (0x12347) Prod^M
[0000.071] I> Boot-mode : Coldboot^M
[0000.075] I> Entry timestamp: 0x00000000^M
[0000.078] I> last_boot_error: 0x0^M
[0000.082] I> BR-BCT: preprod_dev_sign: 0^M
[0000.085] I> rst_source: 0x2, rst_level: 0x1^M
[0000.089] I> Task: SE error check^M
[0000.093] I> Task: Bootchain select WAR set^M
[0000.097] I> Task: Enable SLCG^M
[0000.099] I> Task: CRC check^M
[0000.102] I> Skip FUSE records CRC check as records_integrity fuse is not burned^M
[0000.109] I> Task: Initialize MB2 params^M
[0000.114] I> MB2-params @ 0x40060000^M
[0000.117] I> Task: Crypto init^M
[0000.120] I> Task: Perform MB1 KAT tests^M
[0000.124] I> Task: NVRNG health check^M
[0000.127] I> NVRNG: Health check success^M
[0000.131] I> Task: MSS Bandwidth limiter settings for iGPU clients^M
[0000.137] I> Task: Enabling and initialization of Bandwidth limiter^M
[0000.143] I> No request to configure MBWT settings for any PC!^M
[0000.149] I> Task: Secure debug controls^M
[0000.153] I> Task: strap war set^M
[0000.156] I> Task: Initialize SOC Therm^M
[0000.160] I> Task: Program NV master stream id^M
[0000.164] I> Task: Verify boot mode^M
[0000.170] I> Task: Alias fuses^M
[0000.173]  W> FUSE_ALIAS: Fuse alias on production fused part is not supported.^M

And the ORIN power supply has not changed either.

BTW, no problem when not using RT patch.

Do you have any information that can help me?

Thanks!

Hi,

Can this be observed on DevKits, or only on custom boards?
How often do you see this happen? Would it happen only when running specific apps or not?

Not tested on theDevKits.Only issues have occurred on our on-site equipment.
It has been running at our home for 8 days and no issues have been found

This happens randomly.
Sometimes there are no issues in a day, and sometimes it restarts within 3 hours.

I don’t think so. Our apps are always running.

Do you have any information that can help me?

Thanks!

I’d suggest trying if this can be also observed on a DevKit.
If it cannot be consistently re-produced, then there’s little we can do.
Or see if you can observe something after reboot:

This is the log printed by the debug serial port during reboot.It shows that there are no system abnormalities when restarting.

No issues can be seen from the dmesg logs.

Do you have any methods you can provide me with to troubleshoot this problem.

Thanks!

Then it’s all I can suggest for now.
Or do you have multiple units of the same custom carrier boards for testing?

yes,If RT is used, there will be problems

I mean try with different combinations with module/carrier boards to see if it only happens on specific devices.
Or do something like this to enable more debug log in kernel:

If it cannot be consistently re-produced, and also no abnormal log can be observed, then we can really do nothing.

thinks,I will give it a try

Sorry, perhaps I did not make that quite clear.
Not a specific machine will have problems.
On site, all 30 devices will experience issues,and can continuously re-produced, but cannot observe abnormal logs

Thank you for your help.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.