Hi All.
I have a long-running cuda application that I run on a TX2 module. The system is based on R28.2 release and I have been experiencing gpu lockups. It can take several days before the application does lock up.
The CUDA API calls return that the device is not synchronised (still running a cuda kernel). The tegrastats reports the GPU is stuck at 99% usage.
If I run a different cuda application which resets the cuda device, I then see errors in dmesg which indicate some sort gpu lockup:
[212642.850906] nvgpu: 17000000.gp10b gk20a_fifo_handle_pbdma_intr_0:2689 [ERR] PBDMA intr PBENTRY invld Error
[212642.850917] nvgpu: 17000000.gp10b gk20a_fifo_handle_pbdma_intr_0:2705 [ERR] pbdma_intr_0(0):0x00040000 PBH: 2010000c SHADOW: 0000000f gp shadow0: 003a71a8 gp shadow1: 00036201M0: 80410185 00000000 00000000 00000000
[212642.850925] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 32 for ch 507
[212642.850928] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 32 for ch 506
[212642.850931] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 32 for ch 505
[212642.850934] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 32 for ch 504
[212642.851284] nvgpu: 17000000.gp10b gk20a_fifo_handle_pbdma_intr_0:2689 [ERR] PBDMA intr PBENTRY invld Error
[212642.851293] nvgpu: 17000000.gp10b gk20a_fifo_handle_pbdma_intr_0:2705 [ERR] pbdma_intr_0(0):0x00040000 PBH: 20500000 SHADOW: 00300000 gp shadow0: 003a71a8 gp shadow1: 00036201M0: 00000000 00000000 00000000 00000000
[212642.851298] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 32 for ch 507
[212642.851301] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 32 for ch 506
[212642.851304] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 32 for ch 505
[212642.851307] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 32 for ch 504
I then tried using 32.4.2 release, which also had the same problem.
[495451.246959] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 507
[495451.246965] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 506
[495451.246968] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 505
[495451.246971] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 504
[495451.246976] nvgpu: 17000000.gp10b gk20a_fifo_handle_sched_error:2531 [ERR] fifo sched ctxsw timeout error: engine=0, tsg=4, ms=3100
[495451.247093] ---- mlocks ----
Please find logs attached to this post:
GpuLockup_28_2.log (14.6 KB) GpuLockup_32_4_2.log (14.6 KB)
Is there an known issue which can cause this sort of lockup?
Are there any fixes / mitigations?