TX2 nvgpu lockup

Hi All.

I have a long-running cuda application that I run on a TX2 module. The system is based on R28.2 release and I have been experiencing gpu lockups. It can take several days before the application does lock up.

The CUDA API calls return that the device is not synchronised (still running a cuda kernel). The tegrastats reports the GPU is stuck at 99% usage.

If I run a different cuda application which resets the cuda device, I then see errors in dmesg which indicate some sort gpu lockup:

[212642.850906] nvgpu: 17000000.gp10b gk20a_fifo_handle_pbdma_intr_0:2689 [ERR] PBDMA intr PBENTRY invld Error
[212642.850917] nvgpu: 17000000.gp10b gk20a_fifo_handle_pbdma_intr_0:2705 [ERR] pbdma_intr_0(0):0x00040000 PBH: 2010000c SHADOW: 0000000f gp shadow0: 003a71a8 gp shadow1: 00036201M0: 80410185 00000000 00000000 00000000
[212642.850925] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 32 for ch 507
[212642.850928] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 32 for ch 506
[212642.850931] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 32 for ch 505
[212642.850934] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 32 for ch 504
[212642.851284] nvgpu: 17000000.gp10b gk20a_fifo_handle_pbdma_intr_0:2689 [ERR] PBDMA intr PBENTRY invld Error
[212642.851293] nvgpu: 17000000.gp10b gk20a_fifo_handle_pbdma_intr_0:2705 [ERR] pbdma_intr_0(0):0x00040000 PBH: 20500000 SHADOW: 00300000 gp shadow0: 003a71a8 gp shadow1: 00036201M0: 00000000 00000000 00000000 00000000
[212642.851298] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 32 for ch 507
[212642.851301] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 32 for ch 506
[212642.851304] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 32 for ch 505
[212642.851307] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 32 for ch 504

I then tried using 32.4.2 release, which also had the same problem.

[495451.246959] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 507
[495451.246965] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 506
[495451.246968] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 505
[495451.246971] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 504
[495451.246976] nvgpu: 17000000.gp10b gk20a_fifo_handle_sched_error:2531 [ERR] fifo sched ctxsw timeout error: engine=0, tsg=4, ms=3100
[495451.247093] ---- mlocks ----

Please find logs attached to this post:
GpuLockup_28_2.log (14.6 KB) GpuLockup_32_4_2.log (14.6 KB)

Is there an known issue which can cause this sort of lockup?
Are there any fixes / mitigations?

Hi,

This is a known issue in rel-28.2.
Please upgrade your system to rel-32.

Thanks.

I have tried rel32-4-2 and had similar issues as seen in GpuLockup_32_4_2.log. Given it takes days to run into this issue, it is difficult to test/debug.

Does it make sense that I would run into the issue on 32.4.2 release?
Could you confirm that the cause of the issue has been determined and fixed in R32.4.4 ?

@AastaLLL @WayneWWW

Hi. We are facing this gpu lockup issue on our current product which is based on 28.2. I can update to the newer release 32.4.4 with some difficulty (seem to have some issues that I need to debug). However I do need some confidence in the release.
Could you give me more information on the issue?
Is the cause of the lockup known?
Has the issue been fixed?
Is it possible to backport a fix to R28.2?

As you can see I have hit a similar bug in R32.4.2 - is this the same issue?
Is the issue related to this issue? If so, is the best option to go to R32.3.1 release?

Thanks,

Akmal

No, I think they are different issue.

Please directly go to rel32.4.4.

I am in the process of creating a 32.4.4 based build.
Can you confirm that this issue has been fixed?

Thanks,

Akmal

@WayneWWW @AastaLLL

I have just tried latest 32.4.4 release, and still have the problem. see attached log
"
[227078.417203] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 507
[227078.417209] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 506
[227078.417212] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 505
[227078.417216] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 504
[227078.417220] nvgpu: 17000000.gp10b gk20a_fifo_handle_sched_error:2531 [ERR] fifo sched ctxsw timeout error: engine=0, tsg=4, ms=3100
"
The gpu gets into a lockup state, where tegrastats reports it stuck 99% usage.
Running a different application which launches on the gpu can then it to crash and print out the messages in the log.

nvgpu_lockup_32_4_4.txt (14.3 KB)

Could you tell us what kind of app you are running?

We would like to reproduce this problem with our devkit.

I can discuss the app in private messages.

1 Like