[R38.4] Jetson Thor randon hang up issue

Hi all

We’ve encountered an issue where our Jetson Thor system, paired with a custom carrier board, randomly freezes during the kernel loading phase, although the occurrence rate is low.

We haven’t found a stable way to reproduce this problem yet; it mostly occurs shortly after the Linux kernel loading process is complete.

We’re still trying to reproduce the issue using the Thor Devkit.

The problem is shown in several images below. While there are slight variations, most point to a problem with the memory controller.

What hardware design improvements or software modifications might be needed to resolve this?

We haven’t identified the true root cause yet.

Platform: Jetson Thor T5000 + Custom carrier Board
Jetpack Version: JP7.1
Images:





Best Regards
Jack Lan

This is a T5000 and what Jetpack? I see a bunch of igxbe in your pictures and igxbe is not a part of Jetpack 7.0 - 7.1

How did you flash your board and what is in your Linux_for_Tegra/{board}.conf ?

Hi whitesscott

We used our self-developed carrier board paired with a Jetson Thor T5000, which has an E610 chip. We ported the mainline ixgbe driver to make it work.

Jetpack version: Jetpack 7.1

The flashing command is as follows:

sudo ./l4t_initrd_flash.sh jetson-agx-thor-devkit internal

Best Regards
Jack Lan

Hi jack_lan,

If you could post a text based capture of what you show in the pictures, that could helpful for Nvidia to review. And the log from the flashing of your board.

My apologies if you already did this: if your custom board diverged from the Thor Developer kit board, this guide may point to any modifications needed. Thor Adaptation and Bring Up guide

Hi jack_lan,

Please share the full serial console logs as file instead of the image with logs.

It seems the current issue is specific to your custom carrier board.
I would suggest clarifying the difference between your custom carrier board and the devkit.

Hi whitesscott

The attachment is a file that converts the images of the problem into text.

Information regarding the burning process may not be available until Monday, as the system is undergoing cycle testing.

boot_issue.txt (3.5 KB)
camera_issue.txt (4.6 KB)
kernel_boot_fail.txt (4.8 KB)
開機報錯.txt (4.4 KB)
開機過程error.txt (4.2 KB)

Best Regards
Jack Lan

Hi Kevin

The log from the time the problem occurred is no longer available; currently, only the information recorded in the screenshots is available, or it may be provided again if the problem recurs.

Regarding the hardware differences between custom and Devkit systems, could you please specify the differences we need to provide, such as the power supply design circuitry?

We cannot post all the circuit diagrams on the forum; it’s not feasible due to trade confidentiality.

Best Regards
Jack Lan

Hi jack_lan,

Referencing boot_issue.txt

[ 40.167599] [drm:nv_drm_dev_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00020000] Failed to allocate NvKmsKapiDevice
[ 40.175638] [drm:nv_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00020000] Failed to load device

When you prepared your flashing Jetpack instance, after
tar xf Jetson_Linux_R38.4.0_aarch64.tbz2 and Tegra_Linux_Sample-Root-Filesystem_R38.4.0_aarch64.tbz2

did you then

cd Linux_for_Tegra
sudo apply_binaries.sh --openrm

If not there might be the problem.

Have you referred to Jetson Thor Adaptation and Bring-Up — NVIDIA Jetson Linux Developer Guide when you are developing the custom carrier board?

Please check if whitesscott’s suggestion helps your case.

Hi Kevin

sudo apply_binaries.sh --openrm

We confirmed that this action was executed; otherwise, even the desktop would not be displayed. We discovered this issue early in the development process.

Jetson Thor Adaptation and Bring-Up — NVIDIA Jetson Linux Developer Guide

Regarding the Development Guide, although I’m not responsible for the Carrier Board’s hardware development, our R&D team has all consulted this document.

Or, could the error messages we’re currently receiving provide direction for a closer examination of specific parts of the wiring?

Best Regards
Jack Lan

These 2 lines can be ignored as they are not the reason for the hang up issue.

We would need the serial console log when the issue happens for further debug.
If the issue can be reproduced on the devkit, please also share the reproduce steps with us.

Hi jack_lan,

FATAL ERROR [FILE=platform/drivers/emc/lib/t264/dvfs_training_sequence.c, ERR_UID=9999113]:
DVFS sequence stalled calibration more than threshold!

We have a known issue in GOP driver of UEFI which will be fixed in the next Jetpack release.

Did you have an external display connected when the issue occurred?
If so, please disconnect the display and run the test again to confirm if the behavior persists

Hi KevinFFF

This is valuable information; we do indeed experience occasional hangs during the UEFI phase, and the HDMI cable is always connected to the monitor during testing.

This issue happened again during yesterday’s cycle test.

Below are images of the problem and the debug port message.txt file related to the issue.

output.txt (12.1 KB)

Best Regards
Jack Lan

FATAL ERROR [FILE=platform/drivers/emc/lib/t264/dvfs_training_sequence.c, ERR_UID=9999113]: DVFS sequence stalled calibration more than threshold! 

I do see the same error reported from BPMP in the latest log you shared.

Please help to check if there’s the similar issue when the display is disconnected.

Hi KevinFFF

Cycle testing was performed without any display, and no hang-ups were observed this week.

Since JP7.2 is about to be released, I will continue to monitor subsequent Jetpack versions to see if the same issue occurs.

Thanks!

Yes, please re-verify this with the JetPack 7.2 release as we do not distribute the standalone GOP binary via the forum.