TX2 mmu fault during V4L2 dmabuf capture + cuda analysis

We have been experiencing MMU faults on multiple TX2 systems (running Jetpack 4.6, L4T 32.6.1) during single MIPI camera capture and concurrent cuda analysis. It takes many hours of use to trigger.
From reading other posts with this sort of error, I understand that there is not much you can do without example source that triggers on a devkit system. But I would greatly appreciate any documentation or tips on how to interpret the output of the fault handler (pasted below), e.g. what is “TSG 6”? Or how to enable more tracing in the kernel for this case. The pieces involved are cuda, including managed memory buffers, and direct v4l2 calls using dmabufs and feeding them into cuda via NvEGLImageFromFd and cudaGraphicsEGLRegisterImage.

Feb 02 19:55:32 localhost kernel: ---- mlocks ----
Feb 02 19:55:32 localhost kernel: 
Feb 02 19:55:32 localhost kernel: ---- syncpts ----
Feb 02 19:55:32 localhost kernel: id 2 (disp_a) min 96 max 96 refs 1 (previous client : )
Feb 02 19:55:32 localhost kernel: id 8 (vblank0) min 1704 max -2 refs 1 (previous client : )
Feb 02 19:55:32 localhost kernel: id 11 (dsi) min 197 max 0 refs 1 (previous client : )
Feb 02 19:55:32 localhost kernel: id 20 (gp10b_507) min 12 max 12 refs 1 (previous client : )
Feb 02 19:55:32 localhost kernel: id 21 (gp10b_506) min 338 max 338 refs 1 (previous client : )
Feb 02 19:55:32 localhost kernel: id 23 (gp10b_504) min 14088 max 14090 refs 1 (previous client : )
Feb 02 19:55:32 localhost kernel: id 29 (tegra-vi4) min 98 max 99 refs 1 (previous client : tegra-vi4)
Feb 02 19:55:32 localhost kernel: id 30 (tegra-vi4) min 97 max 99 refs 1 (previous client : tegra-vi4)
Feb 02 19:55:32 localhost kernel: 
Feb 02 19:55:32 localhost kernel: ---- channels ----
Feb 02 19:55:32 localhost kernel: 
                                  channel 2 - 15820000.se
Feb 02 19:55:32 localhost kernel: NvHost basic channel registers:
Feb 02 19:55:32 localhost kernel: CMDFIFO_STAT_0:  00002040
Feb 02 19:55:32 localhost kernel: CMDFIFO_RDATA_0: 24044a17
Feb 02 19:55:32 localhost kernel: CMDP_OFFSET_0:   00000000
Feb 02 19:55:32 localhost kernel: CMDP_CLASS_0:    00000000
Feb 02 19:55:32 localhost kernel: CHANNELSTAT_0:   00000000
Feb 02 19:55:32 localhost kernel: The CDMA sync queue is empty.
Feb 02 19:55:32 localhost kernel: 
Feb 02 19:55:32 localhost kernel: 
                                  channel 3 - 15830000.se
Feb 02 19:55:32 localhost kernel: NvHost basic channel registers:
Feb 02 19:55:32 localhost kernel: CMDFIFO_STAT_0:  00002040
Feb 02 19:55:32 localhost kernel: CMDFIFO_RDATA_0: 18810001
Feb 02 19:55:32 localhost kernel: CMDP_OFFSET_0:   00000000
Feb 02 19:55:32 localhost kernel: CMDP_CLASS_0:    00000000
Feb 02 19:55:32 localhost kernel: CHANNELSTAT_0:   00000000
Feb 02 19:55:32 localhost kernel: The CDMA sync queue is empty.
Feb 02 19:55:32 localhost kernel: 
Feb 02 19:55:32 localhost kernel: 
                                  channel 4 - 15840000.se
Feb 02 19:55:32 localhost kernel: NvHost basic channel registers:
Feb 02 19:55:32 localhost kernel: CMDFIFO_STAT_0:  00002040
Feb 02 19:55:32 localhost kernel: CMDFIFO_RDATA_0: 80090014
Feb 02 19:55:32 localhost kernel: CMDP_OFFSET_0:   00000000
Feb 02 19:55:32 localhost kernel: CMDP_CLASS_0:    00000000
Feb 02 19:55:32 localhost kernel: CHANNELSTAT_0:   00000000
Feb 02 19:55:32 localhost kernel: The CDMA sync queue is empty.
Feb 02 19:55:32 localhost kernel: 
Feb 02 19:55:32 localhost kernel: 
                                  ---- host general irq ----
Feb 02 19:55:32 localhost kernel: sync_intc0mask = 0x00000001
Feb 02 19:55:32 localhost kernel: sync_intmask = 0x50000003
Feb 02 19:55:32 localhost kernel: 
                                  ---- host syncpt irq mask ----
Feb 02 19:55:32 localhost kernel: 
                                  ---- host syncpt irq status ----
Feb 02 19:55:32 localhost kernel: syncpt_thresh_cpu0_int_status(0) = 0x00000000
Feb 02 19:55:32 localhost kernel: syncpt_thresh_cpu0_int_status(1) = 0x00000000
Feb 02 19:55:32 localhost kernel: syncpt_thresh_cpu0_int_status(2) = 0x00000000
Feb 02 19:55:32 localhost kernel: syncpt_thresh_cpu0_int_status(3) = 0x00000000
Feb 02 19:55:32 localhost kernel: syncpt_thresh_cpu0_int_status(4) = 0x00000000
Feb 02 19:55:32 localhost kernel: syncpt_thresh_cpu0_int_status(5) = 0x00000000
Feb 02 19:55:32 localhost kernel: syncpt_thresh_cpu0_int_status(6) = 0x00000000
Feb 02 19:55:32 localhost kernel: syncpt_thresh_cpu0_int_status(7) = 0x00000000
Feb 02 19:55:32 localhost kernel: syncpt_thresh_cpu0_int_status(8) = 0x00000000
Feb 02 19:55:32 localhost kernel: syncpt_thresh_cpu0_int_status(9) = 0x00000000
Feb 02 19:55:32 localhost kernel: syncpt_thresh_cpu0_int_status(10) = 0x00000000
Feb 02 19:55:32 localhost kernel: syncpt_thresh_cpu0_int_status(11) = 0x00000000
Feb 02 19:55:32 localhost kernel: syncpt_thresh_cpu0_int_status(12) = 0x00000000
Feb 02 19:55:32 localhost kernel: syncpt_thresh_cpu0_int_status(13) = 0x00000000
Feb 02 19:55:32 localhost kernel: syncpt_thresh_cpu0_int_status(14) = 0x00000000
Feb 02 19:55:32 localhost kernel: syncpt_thresh_cpu0_int_status(15) = 0x00000000
Feb 02 19:55:32 localhost kernel: syncpt_thresh_cpu0_int_status(16) = 0x00000000
Feb 02 19:55:32 localhost kernel: syncpt_thresh_cpu0_int_status(17) = 0x00000000
Feb 02 19:55:32 localhost kernel: gp10b pbdma 0: 
Feb 02 19:55:32 localhost kernel: id: 6 (tsg), next_id: 6 (tsg) chan status: invalid
Feb 02 19:55:32 localhost kernel: PBDMA_PUT: 0000000100393b18 PBDMA_GET: 0000000100393b04 GP_PUT: 0000070d GP_GET: 0000070c FETCH: 0000070d HEADER: 20100018
                                  HDR: 20022060 SHADOW0: 0026e2f0 SHADOW1: 00031601
Feb 02 19:55:32 localhost kernel: 
Feb 02 19:55:32 localhost kernel: gp10b eng 0: 
Feb 02 19:55:32 localhost kernel: id: 6 (tsg), next_id: 6 (tsg), ctx status: valid 
Feb 02 19:55:32 localhost kernel: faulted 
Feb 02 19:55:32 localhost kernel: busy 
Feb 02 19:55:32 localhost kernel: 
Feb 02 19:55:32 localhost kernel: gp10b eng 1: 
Feb 02 19:55:32 localhost kernel: id: 7 (tsg), next_id: 7 (tsg), ctx status: valid 
Feb 02 19:55:32 localhost kernel: 
Feb 02 19:55:32 localhost kernel: 
Feb 02 19:55:32 localhost kernel: 501-gp10b, pid 4565, refs 2, deterministic: 
Feb 02 19:55:32 localhost kernel: channel status:  in use idle not busy
Feb 02 19:55:32 localhost kernel: RAMFC : TOP: 0000000000000000 PUT: 0000000100651330 GET: 0000000100651330 FETCH: 0000020100651330
                                  HEADER: 60400000 COUNT: 84000000
                                  SYNCPOINT 00000000 00000000 SEMAPHORE 00000001 00047fac 0000010e 00001004
Feb 02 19:55:32 localhost kernel: 
Feb 02 19:55:32 localhost kernel: 502-gp10b, pid 4565, refs 2, deterministic: 
Feb 02 19:55:32 localhost kernel: channel status:  in use idle not busy
Feb 02 19:55:32 localhost kernel: RAMFC : TOP: 0000000000000000 PUT: 0000000100550294 GET: 0000000100550294 FETCH: 0000020100550294
                                  HEADER: 60400000 COUNT: 84000000
                                  SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000
Feb 02 19:55:32 localhost kernel: 
Feb 02 19:55:32 localhost kernel: 503-gp10b, pid 4565, refs 2, deterministic: 
Feb 02 19:55:32 localhost kernel: channel status:  in use idle not busy
Feb 02 19:55:32 localhost kernel: RAMFC : TOP: 0000000000000000 PUT: 0000000100450294 GET: 0000000100450294 FETCH: 0000020100450294
                                  HEADER: 60400000 COUNT: 84000000
                                  SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000
Feb 02 19:55:32 localhost kernel: 
Feb 02 19:55:32 localhost kernel: 504-gp10b, pid 4565, refs 3, deterministic: 
Feb 02 19:55:32 localhost kernel: channel status:  in use on_eng_pending busy
Feb 02 19:55:32 localhost kernel: RAMFC : TOP: 8000001f000e9468 PUT: 0000000100393b18 GET: 0000000100393b04 FETCH: 0000020100393b18
                                  HEADER: 20100018 COUNT: 04550002
                                  SYNCPOINT 00000000 00001701 SEMAPHORE 00000001 0002fff0 00004205 00080004
Feb 02 19:55:32 localhost kernel: 
Feb 02 19:55:32 localhost kernel: 505-gp10b, pid 4565, refs 2, deterministic: 
Feb 02 19:55:32 localhost kernel: channel status:  in use pending busy
Feb 02 19:55:32 localhost kernel: RAMFC : TOP: 0000000000000000 PUT: 000000010026e604 GET: 000000010026e4d0 FETCH: 000002010026e920
                                  HEADER: 20110180 COUNT: 04550002
                                  SYNCPOINT 00000000 00000000 SEMAPHORE 00000001 00047fbc 00000112 00080004
Feb 02 19:55:32 localhost kernel: 
Feb 02 19:55:32 localhost kernel: 506-gp10b, pid 4562, refs 2: 
Feb 02 19:55:32 localhost kernel: channel status:  in use idle not busy
Feb 02 19:55:32 localhost kernel: RAMFC : TOP: 8000001f001818c0 PUT: 0000001f001818c0 GET: 0000001f001818c0 FETCH: 0000001f001818c0
                                  HEADER: 60400000 COUNT: 80000000
                                  SYNCPOINT 00000000 00001501 SEMAPHORE 0000001e 00050aa0 00000241 00001004
Feb 02 19:55:32 localhost kernel: 
Feb 02 19:55:32 localhost kernel: 507-gp10b, pid 4544, refs 2: 
Feb 02 19:55:32 localhost kernel: channel status:  in use idle not busy
Feb 02 19:55:32 localhost kernel: RAMFC : TOP: 8000001f000080c0 PUT: 0000001f000080c0 GET: 0000001f000080c0 FETCH: 0000001f000080c0
                                  HEADER: 60400000 COUNT: 80000000
                                  SYNCPOINT 00000000 00001401 SEMAPHORE 0000001e 00090aa0 00000000 00000002
Feb 02 19:55:32 localhost kernel: 
Feb 02 19:55:32 localhost kernel: 508-gp10b, pid 4000, refs 2: 
Feb 02 19:55:32 localhost kernel: channel status:  in use idle not busy
Feb 02 19:55:32 localhost kernel: RAMFC : TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
                                  HEADER: 60400000 COUNT: 00000000
                                  SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000
Feb 02 19:55:32 localhost kernel: 
Feb 02 19:55:32 localhost kernel: 509-gp10b, pid 4000, refs 2: 
Feb 02 19:55:32 localhost kernel: channel status:  in use idle not busy
Feb 02 19:55:32 localhost kernel: RAMFC : TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
                                  HEADER: 60400000 COUNT: 00000000
                                  SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000
Feb 02 19:55:32 localhost kernel: 
Feb 02 19:55:32 localhost kernel: 510-gp10b, pid 4000, refs 2: 
Feb 02 19:55:32 localhost kernel: channel status:  in use idle not busy
Feb 02 19:55:32 localhost kernel: RAMFC : TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
                                  HEADER: 60400000 COUNT: 00000000
                                  SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000
Feb 02 19:55:32 localhost kernel: 
Feb 02 19:55:32 localhost kernel: 511-gp10b, pid 4000, refs 2: 
Feb 02 19:55:32 localhost kernel: channel status:  in use idle not busy
Feb 02 19:55:32 localhost kernel: RAMFC : TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
                                  HEADER: 60400000 COUNT: 00000000
                                  SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000
Feb 02 19:55:32 localhost kernel: 
Feb 02 19:55:32 localhost kernel: nvgpu: 17000000.gp10b gk20a_fifo_handle_mmu_fault_locked:1722 [ERR]   mmu fault on engine 0, engine subid 0 (gpc), client 1 (t1 0), addr 0xd607cd000, type 0 (pde), access_type 0x00000000,inst_ptr 0x1ffeffb000
Feb 02 19:55:32 localhost kernel: nvgpu: 17000000.gp10b  gk20a_fifo_set_ctx_mmu_error_tsg:1542 [ERR]  TSG 6 generated a mmu fault
Feb 02 19:55:32 localhost kernel: nvgpu: 17000000.gp10b   gk20a_fifo_set_ctx_mmu_error_ch:1531 [ERR]  channel 505 generated a mmu fault
Feb 02 19:55:32 localhost kernel: nvgpu: 17000000.gp10b   nvgpu_set_error_notifier_locked:137  [ERR]  error notifier set to 31 for ch 505
Feb 02 19:55:32 localhost kernel: nvgpu: 17000000.gp10b   gk20a_fifo_set_ctx_mmu_error_ch:1531 [ERR]  channel 504 generated a mmu fault
Feb 02 19:55:32 localhost kernel: nvgpu: 17000000.gp10b   nvgpu_set_error_notifier_locked:137  [ERR]  error notifier set to 31 for ch 504
Feb 02 19:55:32 localhost kernel: nvgpu: 17000000.gp10b   gk20a_fifo_set_ctx_mmu_error_ch:1531 [ERR]  channel 503 generated a mmu fault
Feb 02 19:55:32 localhost kernel: nvgpu: 17000000.gp10b   nvgpu_set_error_notifier_locked:137  [ERR]  error notifier set to 31 for ch 503
Feb 02 19:55:32 localhost cuda_app[4544]: cuda_src.cu:2355: unspecified launch failure err 719
Feb 02 19:55:32 localhost kernel: nvgpu: 17000000.gp10b   gk20a_fifo_set_ctx_mmu_error_ch:1531 [ERR]  channel 502 generated a mmu fault
Feb 02 19:55:32 localhost kernel: nvgpu: 17000000.gp10b   nvgpu_set_error_notifier_locked:137  [ERR]  error notifier set to 31 for ch 502

Hi,
Please check if it helps by executing sudo nvpmodel -m 0 and sudo jetson_clocks. After the commands the system runs in max performance.

Besides, please try to run VIC engine at max clock:
Nvvideoconvert issue, nvvideoconvert in DS4 is better than Ds5? - #3 by DaneLLL
If you have NvBufferTransform() calls in your application, this shall bring max performance of the function call and might help certain race condition.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.