Jetson AGX Xavier module may hang when doing some load test

Hi, we are using 7 Jetson AGX Xavier modules on our PCB, and make each module connected via PCIe & ethernet.
We are running JetPack_4.2-L4T_R32.1 with some modifications on PCIe driver and ethernet driver, and let other components untouched.
Since we cannot migrate to the latest JetPack in coming few month, so we cannot make sure whether this issue will gone in new JetPack SDK.
So we want to know how to debug this issue? Are there some internal debug methods to check ARMv8 core state?
Please give us some advices and guidance, thanks.

The previous thread as a reference:

By the way, when Linux hangs, we can interact with BPMP using jetson-demux, we can make sure that BPMP is alive and ARMv8 core has no interrupt at all.

BPMP console:
] threadstats
thread stats:
total idle time: 2905547000
total busy time: 289470514
reschedules: 5340327
context_switches: 3334774
preempts: 2005619
yields: 10
interrupts: 3730089
timer interrupts: 64268
timers: 64267
]

Hi ztxfr,

Can you share the present uart log so that we can see if something is suspicious.
Note that as of now debug uart prints logs from CCPLEX and other R5 like BPMP in one terminal.
Make sure that these configs are ON in your kernel config

CONFIG_SOFTLOCKUP_DETECTOR=y
CONFIG_LOCKUP_DETECTOR=y
CONFIG_HARDLOCKUP_DETECTOR_OTHER_CPU=y
CONFIG_HARDLOCKUP_DETECTOR=y

and run the test after setting log level to max
dmesg -n 7

thanks

The kernel configs you said above are already turned ON as default, and log level set to max too.
Just in previous thread, kernel hangs and ARMv8 has no interrupt at all, since we added a heartbeat LED in dts and it is driven in kernel timer routine, the LED stops blinking when ARMv8 hangs.So we can confirm that ARMv8 even has no interrupt.
The old uart log can be found in previous thead, and we are working on reproduce this issue, and we will share the newest uart log as soon as possible.

Have you reproduced the issue? Any log can be provided?

Hi kayccc:
This issue have not reproduced these days, and we added some hardware components to monitor power supply.
Once the issue reproduced, we will update the latest log.
Thanks.

Hi,
Please realize it is difficult to know what is going on without any log. Please enable UART log to reproduce the issue, and attach the log for reference.

Also if the modules are not hardly connected to PCB board, you can try to plug in default board and do re-flash through SDKManager. To ensure the modules are good.