We’ve encountered an issue where our Jetson Thor system, paired with a custom carrier board, randomly freezes during the kernel loading phase, although the occurrence rate is low.
We haven’t found a stable way to reproduce this problem yet; it mostly occurs shortly after the Linux kernel loading process is complete.
We’re still trying to reproduce the issue using the Thor Devkit.
The problem is shown in several images below. While there are slight variations, most point to a problem with the memory controller.
What hardware design improvements or software modifications might be needed to resolve this?
If you could post a text based capture of what you show in the pictures, that could helpful for Nvidia to review. And the log from the flashing of your board.
My apologies if you already did this: if your custom board diverged from the Thor Developer kit board, this guide may point to any modifications needed. Thor Adaptation and Bring Up guide
Please share the full serial console logs as file instead of the image with logs.
It seems the current issue is specific to your custom carrier board.
I would suggest clarifying the difference between your custom carrier board and the devkit.
The log from the time the problem occurred is no longer available; currently, only the information recorded in the screenshots is available, or it may be provided again if the problem recurs.
Regarding the hardware differences between custom and Devkit systems, could you please specify the differences we need to provide, such as the power supply design circuitry?
We cannot post all the circuit diagrams on the forum; it’s not feasible due to trade confidentiality.
[ 40.167599] [drm:nv_drm_dev_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00020000] Failed to allocate NvKmsKapiDevice
[ 40.175638] [drm:nv_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00020000] Failed to load device
When you prepared your flashing Jetpack instance, after
tar xf Jetson_Linux_R38.4.0_aarch64.tbz2 and Tegra_Linux_Sample-Root-Filesystem_R38.4.0_aarch64.tbz2
did you then
cd Linux_for_Tegra
sudo apply_binaries.sh --openrm
We confirmed that this action was executed; otherwise, even the desktop would not be displayed. We discovered this issue early in the development process.
Regarding the Development Guide, although I’m not responsible for the Carrier Board’s hardware development, our R&D team has all consulted this document.
Or, could the error messages we’re currently receiving provide direction for a closer examination of specific parts of the wiring?
These 2 lines can be ignored as they are not the reason for the hang up issue.
We would need the serial console log when the issue happens for further debug.
If the issue can be reproduced on the devkit, please also share the reproduce steps with us.
FATAL ERROR [FILE=platform/drivers/emc/lib/t264/dvfs_training_sequence.c, ERR_UID=9999113]:
DVFS sequence stalled calibration more than threshold!
We have a known issue in GOP driver of UEFI which will be fixed in the next Jetpack release.
Did you have an external display connected when the issue occurred?
If so, please disconnect the display and run the test again to confirm if the behavior persists
This is valuable information; we do indeed experience occasional hangs during the UEFI phase, and the HDMI cable is always connected to the monitor during testing.
This issue happened again during yesterday’s cycle test.
Below are images of the problem and the debug port message.txt file related to the issue.