nvvidconv x-raw(memory:NVMM) to x-raw conversion performance

I’m curious if a performance bottleneck we are seeing with nvvidconv is expected or not.

We have six cameras and are running six gstreamer pipelines in our application that look like:

nvcamerasrc sensor-id=X fpsRange="30 30" ! "video/x-raw(memory:NVMM), width=(int)1280, height=(int)1080, format=(string)I420, framerate=(fraction)30/1" ! nvvidconv ! "video/x-raw, width=(int)1280, height=(int)1080, format=(string)I420, framerate=(fraction)30/1" ! appsink

With all six cameras running, we are only getting ~20-21 fps. That’s doing nothing but pulling samples in the appsink callback. However, with a pipeline that stays in NVMM memory (for example recording with omxh264enc), all six cameras are able to stream at 30fps.

Interestingly, if I have nvvidconv resize the images to half size:

nvcamerasrc sensor-id=X fpsRange="30 30" ! "video/x-raw(memory:NVMM), width=(int)1280, height=(int)1080, format=(string)I420, framerate=(fraction)30/1" ! nvvidconv ! "video/x-raw, width=(int)640, height=(int)540, format=(string)I420, framerate=(fraction)30/1" ! appsink

Then our application receives frames at 30fps in the appsink callbacks. Running only 4 cameras instead of all 6 also results in 30fps (at full res) in the appsink callbacks. This seems like it’s an issue with the nvmm to cpu buffer conversion overhead.

Should converting 6 1280x1080 streams at 30fps be possible with nvvidconv? If not, is there an alternative method we should try?

Hi,
You should get better performance to pull video/x-raw(memory:NVMM) buffers in appsink. Please check the sample at
[url]https://devtalk.nvidia.com/default/topic/1037450/jetson-tx2/use-gstreamer-or-tegra_multimedia_api-to-decode-video-would-be-more-efficient-and-increase-throughpu-/post/5270860/#5270860[/url]
Please note that calling NvReleaseFd() is not required on r32.1

Hi DaneLLL. Thanks for the reply.

To clarify a bit, our application is not currently using the multimedia api. Hence the need to have nvvidconv in the pipeline to do the conversion to x-raw so that we can access cpu buffers in the callback. If I modify the gstreamer pipeline so that x-raw(memory:NVMM) is passed to the appsink we do indeed get 30fps for all six cameras at full resolution. However, we can’t read those samples since they are not cpu accessible.

Based on your feedback, it seems that we should use the multimedia api for better performance. From looking at the sample code and documentation, we could use either:

ExtractFdFromNvBuffer
NvBufferMemMap
NvBufferMemSyncForCpu
memcpy_to_our_cpu_buffer
NvBufferMemUnMap
ExtractFdFromNvBuffer
NvBuffer2Raw

I’ll test this tomorrow.

Do you know how the above compares to what nvvidconv does internally? I guess the best option would be to use NvBufferMemMap and then use that pointer instead of adding the additional copy to our own buffer. Is it safe to keep that buffer mapped for a long period of time?

Hi,
Using NvBuffer, which is DMA buffer, can get better performance. If you must execute memcpy_to_our_cpu_buffer to have frames in CPU buffer, your original pipeline is the solution and one more thing you can try is to run ‘sudo jetson_clocks.sh’

We may not need to copy the data to our own buffer. If I use NvBufferMemMap on a sample in the appsink callback is it safe to keep that mapped for a long period of time? or does it need to be unmapped before returning from the callback? In other words, if we don’t unmap those buffers will it cause a problem for nvcamerasrc or other parts of the pipeline? Would it be better to create new buffers with createNvBuffer and copy into those for long term storage? Thanks for your help.

Hi,
It should be fine if you keep NvBufferMemMap status. After CPU processing, you have to call NvBufferMemSyncForDevice() or the buffer can be off-synchronized.

We would suggest allocate local buffers in appsink through NvBufferCreate(), copy nvvidconv buffers through NvBufferTransform(), and return nvvidconv buffers directly.

Did some testing, and found a few interesting things:

Passing maxperf=true to nvarguscamerasrc (32.1) makes a huge difference! What exactly does this option do? The only reference I can find to it in the user guide just says it will increase power consumption.

I was testing 5 different gstreamer pipelines:

A = ExtractFdFromNvBuffer → NvBufferMemMap → NvBufferMemSyncForCpu → memcpy_to_our_cpu_buffer
B = memcpy_to_our_cpu_buffer (data already in cpu buffer)

0.) nvarguscamerasrc (1280x1080, nvmm, 30fps) → appsink (A)
1.) nvarguscamerasrc (640x540, nvmm, 30fps) → appsink (A)
2.) nvarguscamerasrc (1280x1080, nvmm, 30fps) → nvvidconv (1280x1080, x-raw) → appsink (B)
3.) nvarguscamerasrc (1280x1080, nvmm, 30fps) → nvvidconv (640x540, x-raw) → appsink (B)
4.) nvarguscamerasrc (640x540, nvmm, 30fps) → nvvidconv (640x540, x-raw) → appsink (B)

With maxperf=false and six cameras, the actual fps is:

0 → 25fps
1 → 30fps
2 → 15fps
3 → 15fps
4 → 30fps

With maxperf=true all the pipelines achieve 30fps. Why does this default to false? Was there an equivalent to maxperf for nvcamerasrc?

NvBuffer2Raw is much slower (3-4x) then doing ExtractFdFromNvBuffer → NvBufferMemMap → NvBufferMemSyncForCpu → memcpy. What exactly is the use case for NvBuffer2Raw?

The maxperf boost the vi/csi/isp clocks that help for multiple use case.
For your case that could be max_pixel_rate is too small or num_csi_lanes is not correct.

num_csi_lanes = <2>;
 		max_lane_speed = <1500000>;
		min_bits_per_pixel = <10>;
 		vi_peak_byte_per_pixel = <2>;
 		vi_bw_margin_pct = <25>;
 		max_pixel_rate = <160000>;
 		isp_peak_byte_per_pixel = <5>;
 		isp_bw_margin_pct = <25>;

We’ve been transitioning to using libargus instead of gstreamer, and I was curious if there is a similar setting (or settings) to maxperf? In my tests so far, using libargus achieves 30fps at full res with all six cameras. Are the csi/isp clocks boosted by default when using libargus? How exactly is nvarguscamerasrc controlling them? via a public api?

The ISO bandwidth will calculate depend on the value of #8

Thanks for the reply. If my understanding is correct, the clock rates are calculated from the various configuration parameters listed above (num_csi_lanes, max_pixel_rate, etc…). The maxperf option just overrides that and boosts the vi/isp/csi clocks to their max rates? Basically this:

echo 1 > /sys/kernel/debug/bpmp/debug/clk/vi/mrq_rate_locked
echo 1 > /sys/kernel/debug/bpmp/debug/clk/isp/mrq_rate_locked
echo 1 > /sys/kernel/debug/bpmp/debug/clk/nvcsi/mrq_rate_locked
echo {$max_rate} > /sys/kernel/debug/bpmp/debug/clk/vi/rate
echo {$max_rate} > /sys/kernel/debug/bpmp/debug/clk/isp/rate
echo {$max_rate} > /sys/kernel/debug/bpmp/debug/clk/nvcsi/rate

So if full frame rate is achieved only with maxperf=true, it indicates some error in the configuration values in the dtb? Sorry, I’m not a hardware guy.

You can cat those value back for maxperf=true and maxperf=false to confirm it.

I looked into this, and it’s actually the vic clk that’s changing when I set maxperf=true with nvarguscamerasrc. The isp/vi/nvcsi clks stay the same. The vic clk goes from 115200000 to 1024000000.

When using libargus, the vic clk is always 1024000000. isp/vi/nvcsi clks are the same as when using gstreamer + nvarguscamerasrc. How is the vic clk rate determined when using gstreamer + nvarguscamerasrc?