Hi,
I am using Jetson Orin NX JetPack 6.2.1 (L4T R36.4.4) and I want to activate the watchdog in UEFI to handle some exception case. However the watchdog does not seem to work well in UEFI.
I create a simple test as below:
I am adding a timer in UEFI and wait for a while then simulate the crash situation (see below)
EFI_EVENT CrashEvent;
EFI_STATUS CrashStatus = gBS->CreateEvent (
EVT_TIMER | EVT_NOTIFY_SIGNAL,
TPL_CALLBACK,
DelayedCrashCallback,
NULL,
&CrashEvent
);
if (!EFI_ERROR (CrashStatus)) {
gBS->SetTimer (CrashEvent, TimerRelative, 10 \* 1000 \* 1000 \* 10); // 10s = 10^7 \* 100ns
DEBUG ((DEBUG_ERROR, "\[DXE\] Crash timer armed: will crash after 10s\\r\\n"));
{
CHAR8 Msg\[128\];
UINTN Len = AsciiSPrint (Msg, sizeof(Msg),
"\[DXE\] Crash timer armed: will crash after 10s\\n\\r");
if (Len > 0) {
SerialPortWrite ((UINT8 \*)Msg, Len);
}
}
}
STATIC
VOID
EFIAPI
DelayedCrashCallback (
IN EFI_EVENT Event,
IN VOID *Context
)
{
DEBUG ((DEBUG_ERROR, “Simulating crash before kernel load…\n”));
{
CHAR8 Msg\[128\];
UINTN Len;
Len = AsciiSPrint (Msg, sizeof(Msg),
"Simulating crash before kernel load...\\n\\r");
if (Len > 0) {
SerialPortWrite ((UINT8 \*)Msg, Len);
}
}
CauseNullWriteCrash();
}
When I using the simulate using while loop (see below)
STATIC
VOID
CauseNullWriteCrash(VOID)
{
while (1) { }
}
The system can reboot in one minute.
However when i am using the the simulate by write to unknow address (see below)
STATIC
VOID
CauseNullWriteCrash(VOID)
{
volatile UINT32 *ptr = NULL;
*ptr = 0xDEAD;
}
It will never be reboot. Does the Jetson Watchdog have bug? or What am i missing?
I am using UEFI code version 36.4.4 : NVIDIA/edk2-nvidia: NVIDIA EDK2 platform support
Thanks
*** Please note that this reply is generated by LLM automatically ***
I’ll provide a comprehensive answer to your question.
Firstly, I’d like to clarify that the issue you’re facing is related to the watchdog functionality in UEFI on the Jetson Orin NX platform. You’ve tried to simulate a crash situation using two different methods: a while loop and writing to an unknown address. The watchdog seems to work as expected with the while loop, but not with the write to unknown address method.
To understand the potential cause of this issue, let’s dive into the details of the watchdog functionality on the Jetson platform.
The Jetson platform uses a watchdog timer to detect and recover from system crashes or hangs. The watchdog timer is typically configured to trigger a system reset if it is not periodically reset by the system software.
In the context of UEFI, the watchdog timer is usually controlled by the UEFI firmware. The UEFI firmware provides a set of APIs that allow the operating system to interact with the watchdog timer.
Now, let’s examine the two methods you used to simulate a crash situation:
- While loop: In this case, the system is stuck in an infinite loop, which prevents the watchdog timer from being reset. As a result, the watchdog timer expires, and the system reboots. This behavior is expected and indicates that the watchdog timer is functioning correctly.
- Write to unknown address: In this case, the system attempts to write to an unknown address, which causes a crash. However, the watchdog timer does not seem to trigger a system reset. This behavior is unexpected and may indicate an issue with the watchdog functionality.
There are a few possible explanations for this discrepancy:
- Watchdog configuration: It’s possible that the watchdog timer is not configured correctly or is not enabled for the specific crash scenario you’re simulating. You may need to review the UEFI configuration and ensure that the watchdog timer is properly set up.
- Crash handling: The UEFI firmware may have different crash handling mechanisms for different types of crashes. It’s possible that the write to unknown address crash is not handled correctly by the UEFI firmware, which prevents the watchdog timer from triggering a system reset.
- Hardware or firmware issue: There may be a hardware or firmware issue that prevents the watchdog timer from functioning correctly in certain scenarios.
To further troubleshoot this issue, you may want to try the following:
- Review the UEFI configuration and ensure that the watchdog timer is properly set up.
- Use a different method to simulate a crash situation, such as a divide-by-zero error or a null pointer dereference.
- Check the system logs to see if there are any error messages or clues that can help you understand why the watchdog timer is not triggering a system reset.
- Consider updating the UEFI firmware to the latest version to ensure that you have the latest bug fixes and features.
In summary, the issue you’re facing may be related to the watchdog configuration, crash handling, or a hardware or firmware issue. Further troubleshooting is needed to determine the root cause of the problem.
*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***
Hi duc.tran1,
Are you using the devkit or custom board for Orin NX?
Could you try to customize the DefaultExceptionHandler() to trigger the reset?
Hi KevinFFF,
I’m using a custom board based on Orin NX.
In the DefaultExceptionHandler() function, I can reset the board when the CPU hangs by calling ResetSystem() like this :
MicroSecondDelay (2000000);
// Reset the system immediately after exception
if (gRT != NULL) {
DEBUG ((DEBUG_ERROR, "Triggering system reset due to UEFI exception...\n"));
gRT->ResetSystem (EfiResetCold, EFI_SUCCESS, 0, NULL);
}
CpuDeadLoop ();
However, what I actually want is for the board to reset using the hardware watchdog (Watchdog DXE) when the CPU hangs, instead of calling ResetSystem() directly.
Is there any way in UEFI DXE phase to trigger the hardware watchdog to perform a reset in this case ?
Could you share the full device tree for further check?
Hi,
I’m not very clear on the UEFI setup, Could you clarify the exact path to the device tree file for me to check?
UEFI-DTB should be the same as the the kernel-DTB by default.
TBCDTB_FILE="${DTB_FILE}";
Please simply run the following command after boot up and share extracted_proc.dts for further check.
$ sudo dtc -I fs -O dts -o extracted_proc.dts /proc/device-tree
Hi.
Below is the result when running the command
Result.zip (48.2 KB)
But, that’s Linux dtb and Watchdog DXE runs in UEFI before Linux is loaded. So how are they related ?
Could you try enabling the following node to check if it would work?
watchdog@2190000 {
nvidia,timer-index = <0x07>;
nvidia,extend-watchdog-suspend;
nvidia,enable-on-init;
interrupts = <0x00 0x07 0x04 0x00 0x08 0x04>;
timeout-sec = <0x78>;
compatible = "nvidia,tegra-wdt-t234";
nvidia,disable-debug-reset;
- status = "disabled";
+ status = "okay";
reg = <0x00 0x2190000 0x00 0x10000 0x00 0x2090000 0x00 0x10000 0x00 0x2080000 0x00 0x10000>;
nvidia,watchdog-index = <0x00>;
};
As I mentioned before, UEFI-DTB and Kernel-DTB are similar by default. UEFI would also check the device tree for those functions.
You can also try to enable ARM_WATCHDOG config in UEFI.
Hi KevinFFF,
I modified my Linux DTB for Jetson and changed status = "okay";.
In UEFI (/nvidia-config/Jetson/.config) I enabled :
CONFIG_ARM_WATCHDOG=y
CONFIG_ARM_WATCHDOG_CONTROL_BASE=0x02190000
CONFIG_ARM_WATCHDOG_REFRESH_BASE=0x02090000
CONFIG_ARM_WATCHDOG_INTERRUPT=7
The build succeeded and the UEFI uses GenericWatchdogDxe.c. I added DEBUG prints in GenericWatchdogEntry() to print the values of: ControlBase, RefreshBase, Interrupt (see below)
ASSERT_EFI_ERROR (Status);
DEBUG((DEBUG_ERROR, "GenericWatchdogDxe: ControlBase=0x%llx RefreshBase=0x%llx Interrupt=%u\n",
(unsigned long long)PcdGet64(PcdGenericWatchdogControlBase),
(unsigned long long)PcdGet64(PcdGenericWatchdogRefreshBase),
(unsigned)PcdGet32(PcdGenericWatchdogEl2IntrNum)));
{
CHAR8 Msg[128];
UINTN Len;
Len = AsciiSPrint(Msg, sizeof(Msg),
"GenericWatchdogDxe: ControlBase=0x%llx RefreshBase=0x%llx Interrupt=%u\r\n",
(unsigned long long)PcdGet64(PcdGenericWatchdogControlBase),
(unsigned long long)PcdGet64(PcdGenericWatchdogRefreshBase),
(unsigned)PcdGet32(PcdGenericWatchdogEl2IntrNum));
if (Len > 0) {
SerialPortWrite((UINT8 *)Msg, Len);
}
}
/* Make sure the Watchdog Timer Architectural Protocol has not been installed
in the system yet.
This will avoid conflicts with the universal watchdog */
ASSERT_PROTOCOL_ALREADY_INSTALLED (NULL, &gEfiWatchdogTimerArchProtocolGuid);
At boot, the printed values match what I configured in .config.
However, when I simulate a CPU crash by writing to an invalid memory address, the watchdog does not trigger a system reboot.
Am I missing something? and Are there any additional steps required to make the UEFI watchdog trigger a reboot when the CPU crashes?
Thanks
Could you share the full serial console log as file here with debug prints enabled for further check?
Hi,
I’ve attached the full serial console log with DEBUG prints enabled as requested.
Jetson_log.txt (16.3 KB)
Hi, KevinFFF
Just following up on the Jetson_log.txt i shared earlier. Any new updates on this issue ?
Thanks
There is no update from you for a period, assuming this is not an issue anymore.
Hence, we are closing this topic. If need further support, please open a new one.
Thanks ~1217
The current issue may be caused from that GenericWatchdogDxe.c is just a watchdog from SW so that it is not triggered in DefaultExceptionHandler.
Maybe the HW based watchdog is not supported.
May I know what’s your use case for this?