TX2 nvgpu lockup

akmal.ali · November 17, 2020, 11:43am

Hi All.

I have a long-running cuda application that I run on a TX2 module. The system is based on R28.2 release and I have been experiencing gpu lockups. It can take several days before the application does lock up.

The CUDA API calls return that the device is not synchronised (still running a cuda kernel). The tegrastats reports the GPU is stuck at 99% usage.

If I run a different cuda application which resets the cuda device, I then see errors in dmesg which indicate some sort gpu lockup:

[212642.850906] nvgpu: 17000000.gp10b gk20a_fifo_handle_pbdma_intr_0:2689 [ERR] PBDMA intr PBENTRY invld Error
[212642.850917] nvgpu: 17000000.gp10b gk20a_fifo_handle_pbdma_intr_0:2705 [ERR] pbdma_intr_0(0):0x00040000 PBH: 2010000c SHADOW: 0000000f gp shadow0: 003a71a8 gp shadow1: 00036201M0: 80410185 00000000 00000000 00000000
[212642.850925] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 32 for ch 507
[212642.850928] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 32 for ch 506
[212642.850931] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 32 for ch 505
[212642.850934] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 32 for ch 504
[212642.851284] nvgpu: 17000000.gp10b gk20a_fifo_handle_pbdma_intr_0:2689 [ERR] PBDMA intr PBENTRY invld Error
[212642.851293] nvgpu: 17000000.gp10b gk20a_fifo_handle_pbdma_intr_0:2705 [ERR] pbdma_intr_0(0):0x00040000 PBH: 20500000 SHADOW: 00300000 gp shadow0: 003a71a8 gp shadow1: 00036201M0: 00000000 00000000 00000000 00000000
[212642.851298] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 32 for ch 507
[212642.851301] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 32 for ch 506
[212642.851304] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 32 for ch 505
[212642.851307] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 32 for ch 504

I then tried using 32.4.2 release, which also had the same problem.

[495451.246959] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 507
[495451.246965] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 506
[495451.246968] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 505
[495451.246971] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 504
[495451.246976] nvgpu: 17000000.gp10b gk20a_fifo_handle_sched_error:2531 [ERR] fifo sched ctxsw timeout error: engine=0, tsg=4, ms=3100
[495451.247093] ---- mlocks ----

Please find logs attached to this post:
GpuLockup_28_2.log (14.6 KB) GpuLockup_32_4_2.log (14.6 KB)

Is there an known issue which can cause this sort of lockup?
Are there any fixes / mitigations?

AastaLLL · November 18, 2020, 2:07am

Hi,

This is a known issue in rel-28.2.
Please upgrade your system to rel-32.

Thanks.

akmal.ali · November 18, 2020, 8:46am

I have tried rel32-4-2 and had similar issues as seen in GpuLockup_32_4_2.log. Given it takes days to run into this issue, it is difficult to test/debug.

Does it make sense that I would run into the issue on 32.4.2 release?
Could you confirm that the cause of the issue has been determined and fixed in R32.4.4 ?

akmal.ali · November 20, 2020, 5:17pm

@AastaLLL @WayneWWW

Hi. We are facing this gpu lockup issue on our current product which is based on 28.2. I can update to the newer release 32.4.4 with some difficulty (seem to have some issues that I need to debug). However I do need some confidence in the release.
Could you give me more information on the issue?
Is the cause of the lockup known?
Has the issue been fixed?
Is it possible to backport a fix to R28.2?

As you can see I have hit a similar bug in R32.4.2 - is this the same issue?
Is the issue related to this issue? If so, is the best option to go to R32.3.1 release?

Thanks,

Akmal

WayneWWW · November 20, 2020, 5:48pm

No, I think they are different issue.

Please directly go to rel32.4.4.

akmal.ali · November 20, 2020, 6:50pm

I am in the process of creating a 32.4.4 based build.
Can you confirm that this issue has been fixed?

Thanks,

Akmal

akmal.ali · November 23, 2020, 9:47am

@WayneWWW @AastaLLL

I have just tried latest 32.4.4 release, and still have the problem. see attached log
"
[227078.417203] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 507
[227078.417209] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 506
[227078.417212] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 505
[227078.417216] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 504
[227078.417220] nvgpu: 17000000.gp10b gk20a_fifo_handle_sched_error:2531 [ERR] fifo sched ctxsw timeout error: engine=0, tsg=4, ms=3100
"
The gpu gets into a lockup state, where tegrastats reports it stuck 99% usage.
Running a different application which launches on the gpu can then it to crash and print out the messages in the log.

nvgpu_lockup_32_4_4.txt (14.3 KB)

WayneWWW · November 23, 2020, 9:51am

Could you tell us what kind of app you are running?

We would like to reproduce this problem with our devkit.

akmal.ali · November 23, 2020, 10:18am

I can discuss the app in private messages.

sylvain.fabre · January 18, 2022, 9:29am

Hello !

Is this bug soled in a specific release ? Or is it still there ?

kayccc · January 26, 2022, 5:34am

Hi sylvain.fabre,

We don’t have the app to reproduce the issue.
Please help to open a new topic if you met similar problem.

Thanks

Topic		Replies	Views
Kernel panic when starting a CUDA application as a service Jetson TX2 cuda	10	1535	September 12, 2021
TX2 R28.2 crash Jetson TX2	10	1288	January 2, 2020
Gk20a_channel_timeout_handler in Jetson TX2 Jetson TX2 kernel	5	569	April 26, 2023
GPU Hangs When Using OpenCV on the Jetson TX-1 Jetson TX1	13	1908	October 18, 2021
soft lockup on CUDA4.0rc2 CUDA Programming and Performance	3	1767	April 14, 2011
NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [mmc_cmdq_d/0:1506] Jetson TX2	8	3184	October 18, 2021
CUDA Error Generated for Computer Vision Projects running on single Jetson Xavier AGX GPU unit Jetson Xavier NX cuda	16	1236	March 22, 2023
CUDA_ERROR_LAUNCH_FAILED preceded by gpu FIFO (dma) failures Jetson TX2 nvbugs	16	1160	October 18, 2021
persistent kernel causes driver to complain "cannot idle engine 0" and then cease function Jetson TX1	5	1406	December 8, 2016
Nvidia driver for 2080 ti causes one AMD CPU to lock up (Ubuntu) Linux ubuntu	12	5223	April 20, 2021

TX2 nvgpu lockup

Related topics