Gpu not work

1712127445 · May 27, 2025, 12:04pm

We have a module of AGX Orin (32) in our hands, and after running business for a period of time, the GPU cannot work properly. We are using JetPack 5.1.2. The relevant phenomena are as follows:

Call CUDA library。test.cu

#include <stdio.h>
#include <cuda_runtime.h>

int main()
{
    int count;
    cudaGetDeviceCount(&count);
    printf("count = %d\n",count);
    return 0;
}

nvcc test.cu
./test.out

2、 View/dev/nvgpu node

3、dmesg关于gpu info failed

[4396822.611090] nvgpu: 17000000.ga10b gv100_pmu_lsfm_init_acr_wpr_region:53   [ERR]  Failed to execute RPC status=0xfffffff4
[4396822.622334] nvgpu: 17000000.ga10b nvgpu_pmu_lsfm_bootstrap_ls_falcon:107  [ERR]  LSF init WPR region failed
[4396822.622628] nvgpu: 17000000.ga10b nvgpu_pmu_lsfm_bootstrap_ls_falcon:128  [ERR]  LSF Load failed
[4396822.622882] nvgpu: 17000000.ga10b nvgpu_gr_falcon_load_secure_ctxsw_ucode:714  [ERR]  Unable to recover GR falcon
[4396822.623162] nvgpu: 17000000.ga10b        nvgpu_gr_falcon_init_ctxsw:159  [ERR]  fail
[4396822.623385] nvgpu: 17000000.ga10b nvgpu_cic_mon_report_err_safety_services:92   [ERR]  Error reporting is not supported in this platform
[4396822.623718] nvgpu: 17000000.ga10b      gr_init_ctxsw_falcon_support:833  [ERR]  FECS context switch init error
[4396822.624002] nvgpu: 17000000.ga10b            nvgpu_finalize_poweron:1010 [ERR]  Failed initialization for: g->ops.gr.gr_init_support
[4396822.650606] nvgpu: 17000000.ga10b                 gk20a_power_write:127  [ERR]  power_node_write failed at busy
[4396833.710180] nvgpu: 17000000.ga10b                nvgpu_pmu_cmd_post:591  [ERR]  FBQ cmd setup failed
[4396833.710484] nvgpu: 17000000.ga10b             nvgpu_pmu_rpc_execute:713  [ERR]  Failed to execute RPC status=0xfffffff4, func=0x0
[4396833.710817] nvgpu: 17000000.ga10b gv100_pmu_lsfm_init_acr_wpr_region:53   [ERR]  Failed to execute RPC status=0xfffffff4
[4396833.722072] nvgpu: 17000000.ga10b nvgpu_pmu_lsfm_bootstrap_ls_falcon:107  [ERR]  LSF init WPR region failed
[4396833.722377] nvgpu: 17000000.ga10b nvgpu_pmu_lsfm_bootstrap_ls_falcon:128  [ERR]  LSF Load failed
[4396833.722624] nvgpu: 17000000.ga10b nvgpu_gr_falcon_load_secure_ctxsw_ucode:714  [ERR]  Unable to recover GR falcon
[4396833.722910] nvgpu: 17000000.ga10b        nvgpu_gr_falcon_init_ctxsw:159  [ERR]  fail
[4396833.723122] nvgpu: 17000000.ga10b nvgpu_cic_mon_report_err_safety_services:92   [ERR]  Error reporting is not supported in this platform
[4396833.723459] nvgpu: 17000000.ga10b      gr_init_ctxsw_falcon_support:833  [ERR]  FECS context switch init error
[4396833.723726] nvgpu: 17000000.ga10b            nvgpu_finalize_poweron:1010 [ERR]  Failed initialization for: g->ops.gr.gr_init_support
[4396833.750693] nvgpu: 17000000.ga10b                 gk20a_power_write:127  [ERR]  power_node_write failed at busy
[4396835.626139] nvgpu: 17000000.ga10b                nvgpu_pmu_cmd_post:591  [ERR]  FBQ cmd setup failed
[4396835.626429] nvgpu: 17000000.ga10b             nvgpu_pmu_rpc_execute:713  [ERR]  Failed to execute RPC status=0xfffffff4, func=0x0
[4396835.626782] nvgpu: 17000000.ga10b gv100_pmu_lsfm_init_acr_wpr_region:53   [ERR]  Failed to execute RPC status=0xfffffff4
[4396835.638032] nvgpu: 17000000.ga10b nvgpu_pmu_lsfm_bootstrap_ls_falcon:107  [ERR]  LSF init WPR region failed
[4396835.638327] nvgpu: 17000000.ga10b nvgpu_pmu_lsfm_bootstrap_ls_falcon:128  [ERR]  LSF Load failed
[4396835.638578] nvgpu: 17000000.ga10b nvgpu_gr_falcon_load_secure_ctxsw_ucode:714  [ERR]  Unable to recover GR falcon
[4396835.638875] nvgpu: 17000000.ga10b        nvgpu_gr_falcon_init_ctxsw:159  [ERR]  fail
[4396835.639104] nvgpu: 17000000.ga10b nvgpu_cic_mon_report_err_safety_services:92   [ERR]  Error reporting is not supported in this platform
[4396835.639436] nvgpu: 17000000.ga10b      gr_init_ctxsw_falcon_support:833  [ERR]  FECS context switch init error
[4396835.639702] nvgpu: 17000000.ga10b            nvgpu_finalize_poweron:1010 [ERR]  Failed initialization for: g->ops.gr.gr_init_support
[4396835.665634] nvgpu: 17000000.ga10b                 gk20a_power_write:127  [ERR]  power_node_write failed at busy
[4396846.765863] nvgpu: 17000000.ga10b                nvgpu_pmu_cmd_post:591  [ERR]  FBQ cmd setup failed
[4396846.766156] nvgpu: 17000000.ga10b             nvgpu_pmu_rpc_execute:713  [ERR]  Failed to execute RPC status=0xfffffff4, func=0x0
[4396846.766537] nvgpu: 17000000.ga10b gv100_pmu_lsfm_init_acr_wpr_region:53   [ERR]  Failed to execute RPC status=0xfffffff4
[4396846.777774] nvgpu: 17000000.ga10b nvgpu_pmu_lsfm_bootstrap_ls_falcon:107  [ERR]  LSF init WPR region failed
[4396846.778082] nvgpu: 17000000.ga10b nvgpu_pmu_lsfm_bootstrap_ls_falcon:128  [ERR]  LSF Load failed
[4396846.778331] nvgpu: 17000000.ga10b nvgpu_gr_falcon_load_secure_ctxsw_ucode:714  [ERR]  Unable to recover GR falcon

Our equipment is still malfunctioning, please assist in analyzing the cause. We will try our best to continuously power on and restart to destroy the malfunction。

AastaLLL · May 28, 2025, 6:16am

Hi,

Do you use a custom board or AGX Orin devkit?
Please also try to reboot the system to see if the GPU can back to work.

Thanks.

1712127445 · May 28, 2025, 11:21am

1、We use customized boards
2、At present, there is no restart module, and I am worried that the phenomenon will disappear. Someone on the forum mentioned that there was no restart.
This phenomenon is obviously caused by a problem with the GPU driver. Do you have any ideas? Currently, three devices are reporting errors.

AastaLLL · June 2, 2025, 8:26am

Hi,

Could you try if the same issue can be reproduced on the Orin devkit?
We will need to reproduce the same issue internally to gather more info first.

Thanks.

1712127445 · June 17, 2025, 3:34am

kernel output error:
[17538882.801114] os_dump_stack+0x18/0x20 [nvidia]
[17538882.801152] tlsEntryGet+0x130/0x138 [nvidia]
[17538882.801187] gpumgrGetSomeGpu+0x7c/0x90 [nvidia]
[17538882.801222] threadPriorityStateFree+0x234/0x2a0 [nvidia]
[17538882.801256] RmShutdownAdapter+0x168/0x268 [nvidia]
[17538882.801290] rm_shutdown_adapter+0x50/0x70 [nvidia]
[17538882.801324] nv_shutdown_adapter+0xb4/0x4b0 [nvidia]
[17538882.801358] nv_shutdown_adapter+0x2d8/0x4b0 [nvidia]
[17538882.801393] nv_shutdown_adapter+0x3a0/0x4b0 [nvidia]
[17538882.801428] nvidia_dev_put+0xa94/0xc40 [nvidia]
[17538882.801462] nvidia_frontend_close+0x50/0x78 [nvidia]

1712127445 · June 17, 2025, 3:56am

GPU clock information of the problematic device

1712127445 · June 17, 2025, 4:05am

We default to enabling maximum power consumption, but the clock is not locked and the GPU has not been called for a long time. Did the GPU enter low-power mode due to prolonged inactivity

1712127445 · June 19, 2025, 1:41pm

Before the error message “libnvrm_gpu. so: NvRmGpuLibOpen failed, error=6” was reported, the system had very little free memory space and both jtop and tegrastats entered were stuck. May I ask if these situations are related to the error message?

AastaLLL · July 3, 2025, 4:30am

Hi,

Sorry for the late update.
Just want to double-confirm the behavior to collect more info:

Does the CPU or system stall when the GPU locks down? If yes, is the system rebooted by the watchdog?
Have you met this on Devkit?
Are you able to share reproducible steps so we can try it internally?
We need further logs to know more about the exact cause.

Thanks.

1712127445 · July 3, 2025, 6:39am

Hi：

No hardware watchdog has been set up. We suspect that the oom killer mechanism was triggered due to the depletion of system memory, resulting in the killing of GPU related services. Due to the 7-day rotation of system logs, no relevant logs were captured.
We are using customized AGX Orin (32GB) boards
The issue has already occurred on 3 devices, and we have extended the system log rotation by 1 year. We are currently reproducing it (it has been 1 month)

whitesscott · July 3, 2025, 9:14pm

I wonder if the options available in this script could help with oom?

https://github.com/dusty-nv/jetson-containers/blob/master/scripts/optimize-ram-usage.sh

AastaLLL · July 7, 2025, 7:22am

Hi,

Do you mean you are trying to find a reproducible source and share it with us?
Based on your description, the system doesn’t reboot but just GPU lockup, is that correct?

Thanks.

1712127445 · July 9, 2025, 3:19am

Yes, we are testing as expected

The system did not restart, and the kernel lock may have been on the GPU, but no valid logs were captured。

AastaLLL · July 9, 2025, 8:39am

Hi,

In the mean time, could you share the output of below two commands with us?

$ sudo tegrastats

$ sudo cat /sys/kernel/debug/gpu.0/status

Thanks.

1712127445 · July 9, 2025, 9:06am

Stuck with no output, jtop command also freezes with no output

"/sys/kernel/debug/gpu.0/"directory is not functioning properly, and many nodes are missing compared to the normal situation.

Due to the current failure not being replicated, the specific value of/sys/kernel/debug/gpu.0/status is unknown.

1712127445 · July 9, 2025, 9:15am

Normal module:

cat /dev/nvgpu/igpu0/power return:2=
Abnormal module:

cat /dev/nvgpu/igpu0/power return:0

AastaLLL · July 10, 2025, 5:58am

Hi,

It is not expected that the status node doesn’t exist.

Do you find a way (or sample) to reproduce this issue?
If yes, could you share with us so we can give it a try?

Thanks.

1712127445 · July 10, 2025, 6:17am

We are actually using thousands of units, among which there must still be this issue, but it has not been discovered yet. If the device is found to have this phenomenon, are there any auxiliary operations available? For example, saving logs.

We have been monitoring two devices (the ones that had problems before) for a month, but they have not been reproduced.

AastaLLL · July 14, 2025, 8:29am

Hi,

Please try to collect the below logs on the normal and lockup environment:

$ sudo tegrastats

$ sudo cat /sys/kernel/debug/gpu.0/gr_status

$ sudo cat /sys/kernel/debug/gpu.0/status

Please also check if any errors or related logs show up in the dmesg when issue happens:

$ sudo dmesg

Thanks.

Topic		Replies	Views
GPU Failure with CODE 0xbadf2021 at startup Jetson AGX Orin board-design , reboot	2	191	May 15, 2024
Nvgpu: 17000000.gpu nvgpu_channel_recover_from_wdt:112 [ERR] Job on channel 508 timed out Jetson AGX Orin	4	50	July 28, 2025
Gpu report: "nvgpu: 17000000.ga10b nvgpu_gr_intr_handle_sm_exception:390" then reboot Jetson Orin NX gpu , linux-driver	7	805	January 1, 2024
GPU freeze issue on jetson agx orin Jetson AGX Orin cuda , kernel	4	34	July 10, 2025
Orin nvgpu Jetson AGX Orin kernel , chinese	12	1070	August 15, 2023
NVRM gpumgrGetSomeGpu: Failed to retrieve pGpu when reboot on Orin NX module Jetson Orin NX board-design , reboot	20	1832	July 14, 2024
A serious gpu error occurred in jp5.1 orin nx Jetson Orin NX kernel , board-design , gpu	6	53	May 8, 2025
GPU driver abnormality, repeatedly entering the desktop with flashing screen Jetson AGX Orin graphics	4	43	May 15, 2025
Jetson Orin AGX not booting after cuda install Jetson AGX Orin jetpack , boot , reflash , cuda	10	1059	November 22, 2023
Orin GPU error Jetson AGX Orin kernel , chinese	2	410	November 2, 2023

Gpu not work

Related topics