ORIN 64G - nvgpu error

On a custom board, using an ORIN 64GB drive will cause a GPU error. However, using the same SW version on an ORIN 32GB drive will boot normally.

[ 19.140284] nvgpu: 17000000.gpu nvgpu_timeout_expired_msg_cpu:94 [ERR] Timeout detected @ nvgpu_pmu_wait_fw_ack_status+0xcc/0x160 [nvgpu]
[ 19.140330] nvgpu: 17000000.gpu pmu_wait_message_cond:664 [ERR] PMU wait timeout expired.
[ 19.140349] nvgpu: 17000000.gpu nvgpu_pmu_lsfm_bootstrap_ls_falcon:128 [ERR] LSF Load failed
[ 19.140365] nvgpu: 17000000.gpu nvgpu_gr_falcon_load_secure_ctxsw_ucode:718 [ERR] Unable to recover GR falcon
[ 19.140377] nvgpu: 17000000.gpu nvgpu_gr_falcon_init_ctxsw:156 [ERR] fail
[ 19.140419] nvgpu: 17000000.gpu nvgpu_cic_mon_report_err_safety_services:97 [ERR] Error reporting is not supported in this platform
[ 19.140426] nvgpu: 17000000.gpu gr_init_ctxsw_falcon_support:857 [ERR] FECS context switch init error

ORIN_64G_gpu_err.log (106.9 KB)

but use orin 32G is OK

ORIN32_OK1.txt (77.4 KB)

*** Please note that this reply is generated by LLM automatically ***

Based on the provided context, it appears that you are experiencing a GPU error on a custom board using a Jetson AGX Orin with a 64GB drive, while the same software version works normally on an Orin 32GB drive.

To troubleshoot this issue, let’s break down the possible causes:

  1. Hardware differences: The most obvious difference between the two setups is the drive size. However, it’s unlikely that the drive size itself is causing the GPU error. Instead, it’s possible that the custom board or the 64GB drive is introducing some hardware-specific issue that’s not present on the 32GB drive.
  2. Software configuration: Although you mentioned that the same software version is used on both drives, it’s possible that there are some configuration differences or dependencies that are not immediately apparent. For example, the kernel device tree, boot scripts, or other configuration files might be different between the two setups.
  3. GPU-related issues: The GPU error log (ORIN64Ggpuerrlog) might provide some clues about the nature of the issue. It’s possible that the GPU is experiencing some kind of resource constraint, thermal issue, or other problem that’s not present on the 32GB drive.

To further investigate, I would recommend:

  1. Comparing the boot logs: Compare the boot logs from both the 64GB and 32GB drives to see if there are any differences in the boot process, kernel messages, or other system logs.
  2. Verifying the kernel device tree: Check the kernel device tree configuration on both drives to ensure that it’s correct and consistent.
  3. Running hardware validation tests: Run hardware validation tests, such as NvTest (NVIDIA Tegra MODS), to verify that the hardware is functioning correctly on both drives.
  4. Checking the GPU configuration: Verify that the GPU configuration is correct and consistent on both drives. This might involve checking the GPU clock speeds, memory allocation, or other settings.
  5. Consulting the GPU error log: Take a closer look at the GPU error log (ORIN64Ggpuerrlog) to see if it provides any specific error messages or clues about the nature of the issue.

By following these steps, you should be able to gather more information about the issue and potentially identify the root cause of the problem.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

1 Like

extlinux.conf setting error ..
Error caused by selecting the wrong dtb file

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.