CUDA_ERROR_LAUNCH_TIMEOUT (702): the launch timed out and was terminated

Description

We observe these timeouts after porting our algorithms to completely to CUDA; before that we had only parts ported to CUDA while most was running on CPU. The applications run for hours using the same input data over and over again and then suddenly the timeouts occur. The timeouts are observable on both TX2 and TX2NX devices.

The error frequency seems to be random, sometimes the error appears after 5 minutes, and sometimes it appears after 30 hours and millions of frames processed. It looks a bit like TX2NX has a higher error frequency than TX2, but this might be related to a limited statistics. In average, a process runs normally for 25 hours.

We use multiple processes which concurrently use the GPU. The processes use hand-written CUDA kernels as well as functions from nppi, npps, cufft and thrust libraries. Power saving is disabled by using jetson_clocks and NV Power Mode MAXN.

Note that we don’t use Xorg, and we have made sure that the watchdog is not enabled, so these errors are pretty much unexpected. As documented, the affected processes are not able to use CUDA functions anymore after the error has occurred.

Unfortunately we have not yet found a minimal example which would be easy to share, but we hope that the provided information might already be enough to help us. Is this a known issue and are there known workarounds except for restarting the process?

Environments where the error can be reproduced:

TX2 evaluation kit
# R32 (release), REVISION: 4.4, GCID: 23942405, BOARD: t186ref, EABI: aarch64, DATE: Fri Oct 16 19:37:08 UTC 2020

TX2 device integrated in our own hardware
# R32 (release), REVISION: 4.3, GCID: 21589087, BOARD: t186ref, EABI: aarch64, DATE: Fri Jun 26 04:34:27 UTC 2020

TX2NX device integrated in our own hardware
# R32 (release), REVISION: 7.3, GCID: 31982016, BOARD: t186ref, EABI: aarch64, DATE: Tue Nov 22 17:32:54 UTC 2022

Log entries related to the error

Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b   nvgpu_set_error_notifier_locked:136  [ERR]  error notifier set to 8 for ch 506
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b   nvgpu_set_error_notifier_locked:136  [ERR]  error notifier set to 8 for ch 501
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b   nvgpu_set_error_notifier_locked:136  [ERR]  error notifier set to 8 for ch 497
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b   nvgpu_set_error_notifier_locked:136  [ERR]  error notifier set to 8 for ch 493
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b     gk20a_fifo_handle_sched_error:2550 [ERR]  fifo sched ctxsw timeout error: engine=0, tsg=4, ms=3100
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: ---- mlocks ----
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: ---- syncpts ----
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: id 23 (gp10b_503) min 8352 max 8352 refs 1 (previous client : )
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: id 24 (gp10b_502) min 7064 max 7064 refs 1 (previous client : )
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: id 25 (gp10b_501) min 8002 max 8002 refs 1 (previous client : )
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: id 28 (gp10b_500) min 327208 max 327210 refs 1 (previous client : )
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: id 29 (gp10b_499) min 341088 max 341090 refs 1 (previous client : )
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: id 30 (gp10b_498) min 8372 max 8372 refs 1 (previous client : )
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: id 31 (gp10b_497) min 371752 max 371754 refs 1 (previous client : )
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: id 34 (gp10b_494) min 355994 max 355996 refs 1 (previous client : )
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: id 42 (gp10b_486) min 6520 max 6520 refs 1 (previous client : )
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: id 43 (gp10b_485) min 330068 max 330070 refs 1 (previous client : )
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: id 47 (gp10b_481) min 7094 max 7094 refs 1 (previous client : )
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: id 48 (gp10b_480) min 330646 max 330648 refs 1 (previous client : )
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: ---- channels ----
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: NvHost basic channel registers:
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: CMDFIFO_STAT_0:  00002040
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: CMDFIFO_RDATA_0: 00000002
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: CMDP_OFFSET_0:   00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: CMDP_CLASS_0:    00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: CHANNELSTAT_0:   00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: The CDMA sync queue is empty.
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: NvHost basic channel registers:
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: CMDFIFO_STAT_0:  00002040
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: CMDFIFO_RDATA_0: 00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: CMDP_OFFSET_0:   00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: CMDP_CLASS_0:    00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: CHANNELSTAT_0:   00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: The CDMA sync queue is empty.
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: NvHost basic channel registers:
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: CMDFIFO_STAT_0:  00002040
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: CMDFIFO_RDATA_0: 90000040
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: CMDP_OFFSET_0:   00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: CMDP_CLASS_0:    00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: CHANNELSTAT_0:   00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: The CDMA sync queue is empty.
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: sync_intc0mask = 0x00000001
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: sync_intmask = 0x50000003
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: syncpt_thresh_cpu0_int_status(0) = 0x00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: syncpt_thresh_cpu0_int_status(1) = 0x00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: syncpt_thresh_cpu0_int_status(2) = 0x00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: syncpt_thresh_cpu0_int_status(3) = 0x00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: syncpt_thresh_cpu0_int_status(4) = 0x00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: syncpt_thresh_cpu0_int_status(5) = 0x00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: syncpt_thresh_cpu0_int_status(6) = 0x00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: syncpt_thresh_cpu0_int_status(7) = 0x00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: syncpt_thresh_cpu0_int_status(8) = 0x00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: syncpt_thresh_cpu0_int_status(9) = 0x00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: syncpt_thresh_cpu0_int_status(10) = 0x00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: syncpt_thresh_cpu0_int_status(11) = 0x00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: syncpt_thresh_cpu0_int_status(12) = 0x00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: syncpt_thresh_cpu0_int_status(13) = 0x00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: syncpt_thresh_cpu0_int_status(14) = 0x00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: syncpt_thresh_cpu0_int_status(15) = 0x00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: syncpt_thresh_cpu0_int_status(16) = 0x00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: syncpt_thresh_cpu0_int_status(17) = 0x00000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: gp10b pbdma 0: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: id: 6 (tsg), next_id: 6 (tsg) chan status: valid
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: PBDMA_PUT: 00000001002effc0 PBDMA_GET: 00000001002efe9c GP_PUT: 0000085f GP_GET: 00000840 FETCH: 0000085f HEADER: 201101b0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: gp10b eng 0: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: id: 4 (tsg), next_id: 6 (tsg), ctx status: switch 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: busy 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: gp10b eng 1: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: id: 9 (tsg), next_id: 9 (tsg), ctx status: valid 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 478-gp10b, pid 5950, refs 2, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use idle not busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 0000000000000000 PUT: 000000010071f540 GET: 000000010071f540 FETCH: 000002010071f540
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 479-gp10b, pid 5950, refs 2, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use idle not busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 0000000000000000 PUT: 0000000100550294 GET: 0000000100550294 FETCH: 0000020100550294
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 480-gp10b, pid 5950, refs 3, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use pending busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 8000001f00148bf8 PUT: 0000001f00148bf8 GET: 0000001f00148bf8 FETCH: 0000001f00148bf8
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 481-gp10b, pid 5950, refs 3, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use idle not busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 8000001f000f4c88 PUT: 00000001003baa1c GET: 00000001003baa1c FETCH: 00000201003baa1c
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 482-gp10b, pid 5950, refs 2, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use pending busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 0000000000000000 PUT: 000000010031fc2c GET: 000000010031fc2c FETCH: 000002010031fc2c
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 483-gp10b, pid 5979, refs 2, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use pending busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 0000000000000000 PUT: 0000000100708af8 GET: 0000000100708af8 FETCH: 0000020100708af8
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 484-gp10b, pid 5979, refs 2, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use idle not busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 0000000000000000 PUT: 0000000100550294 GET: 0000000100550294 FETCH: 0000020100550294
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 485-gp10b, pid 5979, refs 3, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use pending busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 8000001f001470e0 PUT: 0000001f001470e0 GET: 0000001f001470e0 FETCH: 0000001f001470e0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 486-gp10b, pid 5979, refs 3, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use idle not busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 8000001f000f31a0 PUT: 00000001003a28e0 GET: 00000001003a28e0 FETCH: 00000201003a28e0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 487-gp10b, pid 5979, refs 2, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use idle not busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 0000000000000000 PUT: 00000001002f8f48 GET: 00000001002f8f48 FETCH: 00000201002f8f48
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 488-gp10b, pid 6026, refs 2, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use pending busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 0000000000000000 PUT: 000000010072a580 GET: 000000010072a580 FETCH: 000002010072a580
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 489-gp10b, pid 6064, refs 2, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use pending_acquire busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 0000000000000000 PUT: 00000001006c6274 GET: 00000001006c6218 FETCH: 00000201006c6274
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 490-gp10b, pid 6047, refs 2, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use pending_acquire busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 0000000000000000 PUT: 00000001006d7dc4 GET: 00000001006d7d94 FETCH: 00000201006d7dc4
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 491-gp10b, pid 6026, refs 2, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use idle not busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 0000000000000000 PUT: 0000000100550294 GET: 0000000100550294 FETCH: 0000020100550294
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 492-gp10b, pid 5999, refs 2, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use idle not busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 0000000000000000 PUT: 00000001007280b4 GET: 00000001007280b4 FETCH: 00000201007280b4
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 493-gp10b, pid 6064, refs 2, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use idle not busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 0000000000000000 PUT: 0000000100550294 GET: 0000000100550294 FETCH: 0000020100550294
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 494-gp10b, pid 6026, refs 3, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use pending busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 8000001f00153038 PUT: 0000001f00153038 GET: 0000001f00153038 FETCH: 0000001f00153038
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 495-gp10b, pid 6047, refs 2, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use idle not busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 0000000000000000 PUT: 0000000100550294 GET: 0000000100550294 FETCH: 0000020100550294
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 496-gp10b, pid 5999, refs 2, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use idle not busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 0000000000000000 PUT: 0000000100550294 GET: 0000000100550294 FETCH: 0000020100550294
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 497-gp10b, pid 6064, refs 3, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use pending_acquire busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 8000001f001412f0 PUT: 0000001f00141308 GET: 0000001f001412f0 FETCH: 0000001f00141308
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 498-gp10b, pid 6026, refs 3, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use idle not busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 8000001f000f8870 PUT: 00000001003a792c GET: 00000001003a792c FETCH: 00000201003a792c
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 499-gp10b, pid 6047, refs 3, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use pending busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 8000001f00147578 PUT: 00000001004beaa0 GET: 00000001004bea64 FETCH: 00000201004beaa0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 500-gp10b, pid 5999, refs 3, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use pending_acquire busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 8000001f0015eac8 PUT: 0000001f0015eae0 GET: 0000001f0015eac8 FETCH: 0000001f0015eae0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 501-gp10b, pid 6064, refs 3, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use idle not busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 8000001f000f7718 PUT: 00000001004159a0 GET: 00000001004159a0 FETCH: 00000201004159a0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 502-gp10b, pid 5999, refs 3, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use idle not busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 8000001f000f4b20 PUT: 00000001003c32dc GET: 00000001003c32dc FETCH: 00000201003c32dc
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 503-gp10b, pid 6047, refs 3, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use idle not busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 8000001f000f8780 PUT: 00000001003672d4 GET: 00000001003672d4 FETCH: 00000201003672d4
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 504-gp10b, pid 5999, refs 2, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use idle not busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 0000000000000000 PUT: 0000000100282ca8 GET: 0000000100282ca8 FETCH: 0000020100282ca8
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 505-gp10b, pid 6047, refs 2, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use on_pbdma_and_eng busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 0000000000000000 PUT: 00000001002effc0 GET: 00000001002efe8c FETCH: 00000201002effc0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 506-gp10b, pid 6064, refs 2, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use on_eng_pending busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 0000000000000000 PUT: 00000001002d0fa4 GET: 00000001002d0f28 FETCH: 00000201002d0fa4
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 507-gp10b, pid 6026, refs 2, deterministic: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use idle not busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 0000000000000000 PUT: 0000000100251fc0 GET: 0000000100251fc0 FETCH: 0000020100251fc0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 508-gp10b, pid 4104, refs 2: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use idle not busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 509-gp10b, pid 4104, refs 2: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use idle not busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 510-gp10b, pid 4104, refs 2: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use idle not busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 511-gp10b, pid 4104, refs 2: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: channel status:  in use idle not busy
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: RAMFC : TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: 
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b gk20a_fifo_handle_mmu_fault_locked:1710 [ERR]  fake mmu fault on engine 0, engine subid 0 (gpc), client 0 (l1 0), addr 0x841270f4a000, type 0 (pde), access_type 0x00000000,inst_ptr 0x1028880000
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:128  [ERR]  gr_fecs_os_r : 0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:130  [ERR]  gr_fecs_cpuctl_r : 0x40
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:132  [ERR]  gr_fecs_idlestate_r : 0x1
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:134  [ERR]  gr_fecs_mailbox0_r : 0x3ff
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:136  [ERR]  gr_fecs_mailbox1_r : 0x0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:138  [ERR]  gr_fecs_irqstat_r : 0x0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:140  [ERR]  gr_fecs_irqmode_r : 0x4
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:142  [ERR]  gr_fecs_irqmask_r : 0x8704
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:144  [ERR]  gr_fecs_irqdest_r : 0x0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:146  [ERR]  gr_fecs_debug1_r : 0x40
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:148  [ERR]  gr_fecs_debuginfo_r : 0x0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:150  [ERR]  gr_fecs_ctxsw_status_1_r : 0x140
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:154  [ERR]  gr_fecs_ctxsw_mailbox_r(0) : 0x4
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:154  [ERR]  gr_fecs_ctxsw_mailbox_r(1) : 0x0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:154  [ERR]  gr_fecs_ctxsw_mailbox_r(2) : 0x50009
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:154  [ERR]  gr_fecs_ctxsw_mailbox_r(3) : 0x20
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:154  [ERR]  gr_fecs_ctxsw_mailbox_r(4) : 0x1ffda0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:154  [ERR]  gr_fecs_ctxsw_mailbox_r(5) : 0x0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:154  [ERR]  gr_fecs_ctxsw_mailbox_r(6) : 0x0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:154  [ERR]  gr_fecs_ctxsw_mailbox_r(7) : 0x0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:154  [ERR]  gr_fecs_ctxsw_mailbox_r(8) : 0x0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:154  [ERR]  gr_fecs_ctxsw_mailbox_r(9) : 0x0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:154  [ERR]  gr_fecs_ctxsw_mailbox_r(10) : 0x0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:154  [ERR]  gr_fecs_ctxsw_mailbox_r(11) : 0x0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:154  [ERR]  gr_fecs_ctxsw_mailbox_r(12) : 0x0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:154  [ERR]  gr_fecs_ctxsw_mailbox_r(13) : 0x0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:154  [ERR]  gr_fecs_ctxsw_mailbox_r(14) : 0x0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:154  [ERR]  gr_fecs_ctxsw_mailbox_r(15) : 0x0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:158  [ERR]  gr_fecs_engctl_r : 0x0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:160  [ERR]  gr_fecs_curctx_r : 0x0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:162  [ERR]  gr_fecs_nxtctx_r : 0x0
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:168  [ERR]  FECS_FALCON_REG_IMB : 0xbadfbadf
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:174  [ERR]  FECS_FALCON_REG_DMB : 0xbadfbadf
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:180  [ERR]  FECS_FALCON_REG_CSW : 0xbadfbadf
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:186  [ERR]  FECS_FALCON_REG_CTX : 0xbadfbadf
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:192  [ERR]  FECS_FALCON_REG_EXCI : 0xbadfbadf
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:199  [ERR]  FECS_FALCON_REG_PC : 0xbadfbadf
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:205  [ERR]  FECS_FALCON_REG_SP : 0xbadfbadf
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:199  [ERR]  FECS_FALCON_REG_PC : 0xbadfbadf
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:205  [ERR]  FECS_FALCON_REG_SP : 0xbadfbadf
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:199  [ERR]  FECS_FALCON_REG_PC : 0xbadfbadf
Apr 28 18:36:00 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:205  [ERR]  FECS_FALCON_REG_SP : 0xbadfbadf
Apr 28 18:36:01 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:199  [ERR]  FECS_FALCON_REG_PC : 0xbadfbadf
Apr 28 18:36:01 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b      gk20a_fecs_dump_falcon_stats:205  [ERR]  FECS_FALCON_REG_SP : 0xbadfbadf
Apr 28 18:36:01 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b gk20a_fifo_handle_mmu_fault_locked:1726 [ERR]  gr_status_r : 0x1000081
Apr 28 18:36:01 ovp81x-3b-5f-7f kernel: nvgpu: 17000000.gp10b                    fifo_error_isr:2625 [ERR]  channel reset initiated from fifo_error_isr; intr=0x00000100

Hi,

Could you give the below topic a check to see if you meet the same issue?

Thanks.

Hi,

Could you give r32.7.4 a try?

There is a similar issue that was fixed in March 2023.
So r32.7.4 should contain the fix while other branches don’t.
Thanks.

Thanks for the quick response! We are in progress of updating to r32.7.4 and will report back in the next days.

So indeed updating to 32.7.4 seems to improve the issue. We didn’t observe timeouts since the update. Thanks for that!

Another note: With the pre-32.7.4 versions we seem to observe the timeouts when multiple processes are active on the TX2/TX2NX. When only one process is active, the error seems to show up as a stall instead.

May I ask about some more details about the fixed issue? Are there known workarounds in user code when using pre-32.7.4 versions?

Hi,

Some registers are not well configured which leads to this issue.
The fix is included in the r32.7.4.

If you want to use the earlier branch, please apply the below patch.
9f0f331.diff (3.5 KB)

Thanks.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.