V4L2 capture jitter problem

waterbear · August 18, 2020, 7:42pm

We’ve got our Jetson Xavier based platform up and running with our custom GMSL2 cameras. We are successfully capturing image data from our cameras at a 10fps rate via V4L2 buffers. This is done using an isolated CPU that is dedicated to image capture. Performing timing analysis on the captures, however, shows that we encounter intermittent jitter issues on the order of about 2ms. We have a critical processing time budget for our project and this 2ms jitter is causing us problems.

We’re looking for guidance at what might be inducing this and how we can best address it. Please read below for details.

In terms of setup, we’ve got jetson_clocks configured and are running nvpmodel 0 (MAXN):

nvidia@tegra-ubuntu:~/target$ sudo jetson_clocks --show
SOC family:tegra194  Machine:jetson-xavier
Online CPUs: 0-7
CPU Cluster Switching: Disabled
cpu0: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0 
cpu1: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0 
cpu2: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0 
cpu3: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0 
cpu4: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0 
cpu5: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0 
cpu6: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0 
cpu7: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0 
GPU MinFreq=1377000000 MaxFreq=1377000000 CurrentFreq=1377000000
EMC MinFreq=204000000 MaxFreq=2133000000 CurrentFreq=2133000000 FreqOverride=1
Fan: speed=255
NV Power Mode: MAXN

We’ve got highres kernel timers in place that generate a sync signal to the cameras for image capture. The sync signal generation as well as the sof/eof timing reported in the trace log do not exhibit any jitter problems.

Using ftrace (via the trace-cmd tool), we’ve been able to see that the delta time from “tegra_channel_capture_frame: eof:” to “vb2_dqbuf:” takes an average of ~311us. When the jitter manifests itself, this time can be as much as ~2.5ms. For a 10000 frame capture, we have seen anywhere from 10-30 occurrences of these extended jitter times.

We’ve dug into the ftrace capture data and have compared a ‘normal’ (~245us) capture delta vs a jitter (~2.45ms) capture delta. Below are some ftrace snippets with very large execution deltas when a jitter issue was detected:

snippet #1

vi-outpu-18382   7...1  5932.813431: function:             vb2_buffer_done
vi-outpu-18382   7...1  5932.813432: function:                vb2_dc_finish
vi-outpu-18382   7...1  5932.813432: function:                   __iommu_sync_sg_for_cpu
vi-outpu-18382   7...1  5932.813432: function:                      __iommu_sync_sg_for_cpu.part.12
# the delta for the next line is 5932.814313 - 5932.813432 = 881us! 
vi-outpu-18382   7d..1  5932.814313: function:             __handle_domain_irq 
vi-outpu-18382   7d..1  5932.814316: function:                irq_enter
vi-outpu-18382   7d..1  5932.814316: function:                   rcu_irq_enter

snippet #2:

vi-outpu-18382   7d.s1  5932.814385: function:             irqtime_account_irq
vi-outpu-18382   7d.s1  5932.814385: function:                irqtime_account_delta.isra.0
vi-outpu-18382   7d.s1  5932.814386: function:             __local_bh_enable
vi-outpu-18382   7d..1  5932.814386: function:             idle_cpu
vi-outpu-18382   7d..1  5932.814386: function:             rcu_irq_exit
# the delta for the next line is 5932.815317 - 5932.814386 = 931us!
vi-outpu-18382   7d..1  5932.815317: function:             __handle_domain_irq
vi-outpu-18382   7d..1  5932.815318: function:                irq_enter
vi-outpu-18382   7d..1  5932.815318: function:                   rcu_irq_enter
vi-outpu-18382   7d..1  5932.815318: function:                   irqtime_account_irq
vi-outpu-18382   7d.h1  5932.815318: function:                irq_find_mapping
vi-outpu-18382   7d.h1  5932.815319: function:                generic_handle_irq
vi-outpu-18382   7d.h1  5932.815319: function:                   handle_percpu_devid_irq
vi-outpu-18382   7d.h.  5932.815320: irq_handler_entry:    irq=2 name=arch_timer

snippet #3

vi-outpu-18382   7d.s1  5932.815378: function:             irqtime_account_irq
vi-outpu-18382   7d.s1  5932.815378: function:                irqtime_account_delta.isra.0
vi-outpu-18382   7d.s1  5932.815378: function:             __local_bh_enable
vi-outpu-18382   7d..1  5932.815379: function:             idle_cpu
vi-outpu-18382   7d..1  5932.815379: function:             rcu_irq_exit
# the delta for the next line is 5932.815777 - 5932.815379 = 398us!
vi-outpu-18382   7...1  5932.815777: function:             _raw_spin_lock_irqsave
vi-outpu-18382   7d..2  5932.815778: function:             _raw_spin_unlock_irqrestore
vi-outpu-18382   7....  5932.815783: vb2_buf_done:         owner = 0xffffffc3deaf7c80, queued = 10, owned_by_drv = 9, index = 1, type = 1, bytesused = 8294400, timestamp = 5932581240250

Summing the worst delta times from above gives: 881us + 931us + 398us = 2.21ms!

Are there any hints or guidance on what may be inducing this? At a guess, there might be some hardware-level operations going on (DMA perhaps?) that are causing the time delays in the ftrace. It also could be that we need to enable additional ftrace options in the kernel config to fill in what’s going on above. While ftrace is helping out greatly, it’s not painting to whole picture for us quite yet.

I’ve got a 600 frame capture trace that the snippets above were pulled from that I can provide. It is, however, too large for me to upload to the forum, or I’d just attach it here.

Thanks.

waterbear · August 18, 2020, 8:19pm

FYI–

Here’s the ftrace command that we’re using, if it helps:

sudo trace-cmd record -b 10000 -M 80 -p function -e all -o /tmp/trace.dat

This limits the trace to the isolated core, CPU7, that is doing the image capture.

ShaneCCC · August 19, 2020, 3:49am

Did you using v4l2-ctl to capture?

waterbear · August 19, 2020, 11:44am

No, we are not using v4l2-ctl. We are using V4L2 APIs / ioctl to capture with a tool that was derived from the V4L2 API example found here: 10.1. file: media/v4l/capture.c — The Linux Kernel documentation

waterbear · August 19, 2020, 3:01pm

One additional data point – we are using the V4L2_MEMORY_USERPTR option in the capture tool.

waterbear · August 19, 2020, 5:21pm

Update:

I wrote a script that utilized v4l2-ctl to essentially do what our capture tool does with one important difference: I specified “–stream-mmap” instead of using “–stream-user”. Trying to specify “–stream-user” resulted in a failure for me (more on this below). The script captured 600 samples of image data and then processed the ftrace output captured via trace-cmd. If the time between “tegra_channel_capture_frame: eof:” and “vb2_dqbuf:” was > 1000us, it would terminate. Otherwise, it would start another 600 sample capture and begin again. The script also calculated each iteration’s min/max/avg times.

I let the script run for about an hour. It did not fail even a single time. The most recent iteration min/max/avg I see is 70us/300us/97us. I know as I was watching things run I saw a max value of around 600us, but nothing exceeding 1000us.

Because we are using V4L2_MEMORY_USERPTR, I wanted to get v4l2-ctl working with --stream-user. I cloned the v4l-utils repo from v4l-utils.git - media (V4L2, DVB and IR) applications and libraries and checked out the stable-1.20 branch, since “sudo apt install v4l-utils” only gives version 1.14.2-1. The latest version of v4l2-ctl also did not work – --stream-user reports something along the lines of “VIDIOC_QBUF: Invalid argument”. It turns out that there is a memory alignment issue when you specify --stream-user. I corrected this by updating common/v4l-helpers.h in this section of code:

static inline int v4l_queue_alloc_bufs(struct v4l_fd *f,
		struct v4l_queue *q, unsigned from)
{
	unsigned b, p;

	if (q->memory != V4L2_MEMORY_USERPTR)
		return 0;
	for (b = from; b < v4l_queue_g_buffers(q); b++) {
		for (p = 0; p < v4l_queue_g_num_planes(q); p++) {
#if 0
			void *m = malloc(v4l_queue_g_length(q, p));
			if (m == NULL)
				return errno;
#else
 			// correct memory alignment issues
			void *m = NULL;
			int page_size;
			int err;

			page_size = getpagesize();
			if ((err = posix_memalign(&m,page_size,v4l_queue_g_length(q, p)))) return -1;
			if (m == NULL) return -1;
#endif
			v4l_queue_s_userptr(q, b, p, m);
		}
	}
	return 0;
}

Now, running with this command in my script gives me a detectable error: ./v4l2-ctl -d /dev/video0 --set-fmt-video=width=1920,height=1080,pixelformat=RGB3 --stream-user=4 --stream-count=600 --stream-to=/dev/null

The jitter time error that I get is the same magnitude as what I was seeing using our capture.c tool derivative.

I need to check with our applications team, but I don’t think we can use mmap as that causes other issues with our processing pipeline.

Any ideas what may be up with specifying V4L2_MEMORY_USERPTR and the jitter this is inducing? Is this pointing to a potential DMA issue?

waterbear · August 19, 2020, 6:39pm

Looking at the trace gathered from running v4l2-ctl shows a similar pattern to the trace I posted above:

	vi-outpu-27650   7...1 18829.656975: function:             vb2_buffer_done
	vi-outpu-27650   7...1 18829.656975: function:                vb2_dc_finish
	vi-outpu-27650   7...1 18829.656975: function:                   __iommu_sync_sg_for_cpu
	vi-outpu-27650   7...1 18829.656976: function:                      __iommu_sync_sg_for_cpu.part.12
# delta for the next line is 18829.659225 - 18829.656976 = 2249us!	
	vi-outpu-27650   7...1 18829.659225: function:                _raw_spin_lock_irqsave

There’s a significant time spike after __iommu_sync_sg_for_cpu.part.12 appears.

waterbear · August 19, 2020, 6:56pm

I performed some further analysis of the ftrace buffer and I have 448 occurrences of “__iommu_sync_sg_for_cpu.part.12”. Taking the timestamp delta between that function and the following function, I see an average time of 178.25us. There are two occurrences in the trace where that delta was greater than 2200us. One is shown above in my previous post. The other is:

vi-outpu-27650   7...1 18840.356571: function:             vb2_buffer_done
vi-outpu-27650   7...1 18840.356571: function:                vb2_dc_finish
vi-outpu-27650   7...1 18840.356571: function:                   __iommu_sync_sg_for_cpu
vi-outpu-27650   7...1 18840.356571: function:                      __iommu_sync_sg_for_cpu.part.12
# delta for the next line is 18840.358832 -18840.356571 = 2261us!
vi-outpu-27650   7...1 18840.358832: function:                _raw_spin_lock_irqsave

ShaneCCC · August 20, 2020, 2:28am

Does this only happen with V4L2_MEMORY_USERPTR?

waterbear · August 20, 2020, 11:45am

Yes, as described above, this seems to only happen with ‘V4L2_MEMORY_USERPTR’ (--stream-user option in v4l2-ctl). Specifying ‘V4L2_MEMORY_MMAP’ (--stream-mmap option in v4l2-ctl) does not seem to induce the jitter issue.

ShaneCCC · August 20, 2020, 12:04pm

Why you can’t use V4L2_MEMORY_USERPTR?

waterbear · August 20, 2020, 12:07pm

Mentioned above: “I need to check with our applications team, but I don’t think we can use mmap as that causes other issues with our processing pipeline.”

As I recall, we saw timing delay problems later in the processing pipeline as a result of using MMAP. I still need to follow-up on that.

waterbear · August 20, 2020, 12:19pm

I guess what I’m hoping out of this thread is an explanation as to what is causing this intermittent delay. Is it a DMA issue?
Is it something else? Why is V4L2_MEMORY_USERPTR showing this hit whereas V4L2_MEMORY_MMAP does not seem to be?

Are there any NVIDIA memory controller experts / designers available that can potentially weigh in?

waterbear · September 3, 2020, 2:19pm

This is still an open issue for us.

waterbear · September 14, 2020, 5:46pm

As an experiment on isolating the jitter, I updated dma-mapping.c in the kernel as follows:

	static void __iommu_sync_sg_for_cpu(struct device *dev,
						struct scatterlist *sgl, int nelems,
						enum dma_data_direction dir)
	{
	#if 0
		struct scatterlist *sg;
		int i;
		if (is_device_dma_coherent(dev))
			return;
		
		for_each_sg(sgl, sg, nelems, i)
			__dma_unmap_area_no_dsb(sg_virt(sg), sg->length, dir);
		dsb(sy);
	#endif
	}
	static void __iommu_sync_sg_for_device(struct device *dev,
						   struct scatterlist *sgl, int nelems,
						   enum dma_data_direction dir)
	{
	#if 0
		struct scatterlist *sg;
		int i;
		if (is_device_dma_coherent(dev))
			return;
		
		for_each_sg(sgl, sg, nelems, i)
			__dma_map_area_no_dsb(sg_virt(sg), sg->length, dir);
		dsb(sy);
	#endif
	}

I realize this most certainly introduces coherency problems between the device and the CPU. However, this change completely eliminated the jitter issue. I enabled ftrace to monitor IOMMU events and see that my allocations for V4L2 buffers generate quite a few map events. If I comment the for_each_sg loop out, the jitter is not present. If I enable the loop, and comment out the memory barrier function dsb(sy);, the jitter issue returns.

I’m not certain if the issue is the loop itself, or the cache invalidation that happens for each iteration of the loop. In other words, even if we had a smaller number of SG elements, is it simply the cache invalidation as a whole that is causing the jitter? I’m wondering if invalidating a bunch of small sections is equivalent to invalidating one large section.

We ultimately allocate 4 V4L2 buffers sized to 1920x1080x4 bytes + 4096 bytes = 8298496 bytes. That totals 33183874 bytes of memory. We do this using posix_memalign(). The IOMMU events show a large number of 4096 byte allocations all the way up to a few 4194304 byte allocations. FWIW, I ran my capture program several times and saw the number of IOMMU mapping events hit around 1000+.

Are there any ways to reduce the number of SG elements that the IOMMU ultimately allocates? Is there anyway to determine the worst case timing for invalidating 2026 cached pages (2026 pages x 4096 bytes / page = 8298496 bytes)? Are there any settings that can reduce the time it takes to invalidating the cache?

waterbear · September 15, 2020, 1:36pm

An MMAP’d configuration apparently hits these same functions, but the is_device_dma_coherent(dev) check evaluates to true and short-circuits the function altogether. That apparently is why we don’t see the jitter there.

ShaneCCC · September 15, 2020, 5:00pm

The probable solution is at the time of creating scatter gather list sort and club the pages so that
there will be lesser “nelems” created and the lesser cache line will be clean and invalidated

waterbear · September 15, 2020, 6:02pm

I added timing stats to the __iommu_sync_sg_for_cpu() function. Whether a SG is 11 nelems long or 300+ nelems long, I’m seeing the jitter occurring. So, I do not think it is related to the number of elements. Rather, it’s the actual cache invalidation itself that is causing the time penalty.

Here’s a snippet that output stats whenever the execution time of a single __dma_unmap_area_no_dsb call took > 1ms. I reported the min/max/avg time of that call followed by the overall time consumed by the for_each_sg() loop:

[  110.061058] __iommu_sync_sg_for_cpu():1052: __iommu_sync_sg_for_cpu:1052: nelems: 11, dir: 2
[  110.061068] __iommu_sync_sg_for_cpu():1054: __iommu_sync_sg_for_cpu:1054: min=416 ns, max=1193707 ns, avg=179856 ns
[  110.061072] __iommu_sync_sg_for_cpu():1058: __iommu_sync_sg_for_cpu:1058: overall delta: 0.001979346 s
[  110.461293] __iommu_sync_sg_for_cpu():1052: __iommu_sync_sg_for_cpu:1052: nelems: 11, dir: 2
[  110.461305] __iommu_sync_sg_for_cpu():1054: __iommu_sync_sg_for_cpu:1054: min=512 ns, max=2000850 ns, avg=201235 ns
[  110.461309] __iommu_sync_sg_for_cpu():1058: __iommu_sync_sg_for_cpu:1058: overall delta: 0.002214708 s
[  129.961258] __iommu_sync_sg_for_cpu():1052: __iommu_sync_sg_for_cpu:1052: nelems: 23, dir: 2
[  129.961316] __iommu_sync_sg_for_cpu():1054: __iommu_sync_sg_for_cpu:1054: min=1312 ns, max=1101059 ns, avg=96510 ns
[  129.961329] __iommu_sync_sg_for_cpu():1058: __iommu_sync_sg_for_cpu:1058: overall delta: 0.002225190 s
[  195.759023] __iommu_sync_sg_for_cpu():1052: __iommu_sync_sg_for_cpu:1052: nelems: 309, dir: 2
[  195.759032] __iommu_sync_sg_for_cpu():1054: __iommu_sync_sg_for_cpu:1054: min=160 ns, max=1095188 ns, avg=7626 ns
[  195.759034] __iommu_sync_sg_for_cpu():1058: __iommu_sync_sg_for_cpu:1058: overall delta: 0.002369611 s

waterbear · September 16, 2020, 12:33pm

I did an experiment where I performed allocations using cudaMallocManaged() in our capture program. I monitored the IOMMU operations and monitored the jitter. Both IOMMU events and 2ms jitter were similar to memory allocated via posix_memalign(). Memory allocated via cudaMallocManaged() is not DMA coherent and hits the same sections of the __iommu_sync_sg_for_cpu() function as posix_memalign().

ShaneCCC · September 25, 2020, 3:49am

Could you try this patch to check if can help.

diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index 94aa049..3d39662 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -996,6 +996,7 @@ static void __iommu_sync_sg_for_cpu(struct device *dev,
                                    struct scatterlist *sgl, int nelems,
                                    enum dma_data_direction dir)
 {
+#if 0
        struct scatterlist *sg;
        int i;

@@ -1005,12 +1006,15 @@ static void __iommu_sync_sg_for_cpu(struct device *dev,
        for_each_sg(sgl, sg, nelems, i)
                __dma_unmap_area_no_dsb(sg_virt(sg), sg->length, dir);
        dsb(sy);
+#endif
+       __flush_dcache_all();
 }

 static void __iommu_sync_sg_for_device(struct device *dev,
                                       struct scatterlist *sgl, int nelems,
                                       enum dma_data_direction dir)
 {
+#if 0
        struct scatterlist *sg;
        int i;

@@ -1020,6 +1024,8 @@ static void __iommu_sync_sg_for_device(struct device *dev,
        for_each_sg(sgl, sg, nelems, i)
                __dma_map_area_no_dsb(sg_virt(sg), sg->length, dir);
        dsb(sy);
+#endif
+       __flush_dcache_all();
 }