cudaMemcpyAsync low throughput

So I’m working on a live video stitching gstreamer pipeline on TX2, and having performance issue.

DISPLAY=:0 gst-launch-1.0 \
v4l2src device=/dev/video0 ! 'video/x-raw,width=1920,height=1080,format=UYVY,framerate=30/1' ! queue ! st. \
v4l2src device=/dev/video1 ! 'video/x-raw,width=1920,height=1080,format=UYVY,framerate=30/1' ! queue ! st. \
v4l2src device=/dev/video2 ! 'video/x-raw,width=1920,height=1080,format=UYVY,framerate=30/1' ! queue ! st. \
v4l2src device=/dev/video3 ! 'video/x-raw,width=1920,height=1080,format=UYVY,framerate=30/1' ! queue ! st. \
videostitcher ! \
'video/x-raw, width=(int)3840, height=(int)1920, format=(string)I420, framerate=(fraction)30/1' ! \
nvoverlaysink sync=false

videostitcher is an element that uses cuda to stitch multiple inputs. It uses cudaMemcpyAsync and streams for concurrency. The problem here is that cudaMemcpyAsync has very low throughput, 400~700MB/s (from nvprof and nv visual profiler). Then I use oprofile to see what happend, and found that memcpy takes most of the cpu time:

CPU: ARM Cortex-A57, speed 2035.2 MHz (estimated)
Counted CPU_CYCLES events (Cycle) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        linenr info                 image name               symbol name
160549   67.0779  memcpy.S:50                 libc-2.23.so             memcpy
72433    30.2628  (no location information)   no-vmlinux               /no-vmlinux
2525      1.0550  (no location information)   libcuda.so.1.1           /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1
344       0.1437  (no location information)   libgobject-2.0.so.0.4800.2 /usr/lib/aarch64-linux-gnu/libgobject-2.0.so.0.4800.2
281       0.1174  (no location information)   libglib-2.0.so.0.4800.2  /lib/aarch64-linux-gnu/libglib-2.0.so.0.4800.2

Then I gdb it a bit to see the callstack, and it seems all the memcpy is from inside cudaMemcpyAsync. My question is on what circumstance cudaMemcpyAsync will use memcpy so heavily? How to avoid this.

PS.
We uses Toshiba TC358743 to capture video. Seems the src element has a lot to do with this. Because if I replace all the v4l2src with videotestsrc, this heavy use of memcpy disappear (below).

DISPLAY=:0 gst-launch-1.0 \
videotestsrc  pattern=2 ! 'video/x-raw, width=(int)1920, height=(int)1080, format=(string)UYVY, framerate=(fraction)30/1' ! queue ! st. \
videotestsrc  pattern=2 ! 'video/x-raw, width=(int)1920, height=(int)1080, format=(string)UYVY, framerate=(fraction)30/1' ! queue ! st. \
videotestsrc  pattern=2 ! 'video/x-raw, width=(int)1920, height=(int)1080, format=(string)UYVY, framerate=(fraction)30/1' ! queue ! st. \
videotestsrc  pattern=2 ! 'video/x-raw, width=(int)1920, height=(int)1080, format=(string)UYVY, framerate=(fraction)30/1' ! queue ! st. \
videostitcher ! \
'video/x-raw, width=(int)3840, height=(int)1920, format=(string)I420, framerate=(fraction)30/1' ! \
nvoverlaysink sync=false
CPU: ARM Cortex-A57, speed 2035.2 MHz (estimated)
Counted CPU_CYCLES events (Cycle) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        linenr info                 image name               symbol name
107261   24.9035  tmp-orc.c:2194              libgstvideo-1.0.so.0.1203.0 video_orc_pack_UYVY
99342    23.0649  video-chroma.c:691          libgstvideo-1.0.so.0.1203.0 video_chroma_down_h2_cs_u8
98056    22.7663  memcpy.S:50                 libc-2.23.so             memcpy
63740    14.7989  (no location information)   no-vmlinux               /no-vmlinux
49138    11.4087  tmp-orc.c:159               libgstvideotestsrc.so    video_test_src_orc_splat_u32
5098      1.1836  (no location information)   libcuda.so.1.1           /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1

Hi,

Due to some hardware issue, the parallelism of asynchronous engine on Jetson is limited.
As a result, the memcpy will be executed sequentially and may cause the low throughput you mentioned here.

Thanks.

hi AastaLLL,

What is this limitation in detail? What should we do on the software end to avoid this?

Please share some docs or pointers.

Formerly, we tested TX2 capturing video with CSI-2 bayer sensor, which has no such memcpy issue. Considering the data path inside TX2 is the same between capturing with a bayer sensor and a TC358743 (CSI-2 & VI), it’s reasonable to think that we can avoid this issue just like that, right?

Thanks.

Hi,

Due to different hardware architecture from desktop GPU, there are some CUDA function is not supported on Jetson.
You can find detail in our document: https://docs.nvidia.com/cuda/pdf/NVIDIA_CUDA_for_Tegra_Application_Note.pdf

Thanks.

hi AastaLLL,

This is very helpful, thanks.

However, there’s something I can’t find the answer from these doc. We have 2 demo systems running the same application. Both systems use TX2, while one uses Magewell Pro-Capture card, and the other uses TC358743. The buffers used in capture (before feeded to cuda) are both pagable. They showed different throughput when memcpy H2D.

In the picture, left side is from the system with Magewell card, right side is from the one with TC358743.

I wonder if there’s more properties than Pagable/Unpagable that eluded me, that is making the throughput difference. What is that?

Hi,

Suppose that Magewell is a 4K capture card while TC358743 only supports 1080p.

Could you confirm this with the spec of your capture card?
If yes, the bandwidth of Magewell should be four times larger than the TC358743.

Thanks.

The capture card is Pro Capture Quad HDMI:
http://www.magewell.com/pro-capture-quad-hdmi

From the website, each port can capture resolution up to 2048x2160, and it has 4 ports.

Our system with TC358743, has 6 TC358743 connected to CSI-2 ports of TX2, each one uses 2 lanes.

I think both systems are capable of capturing 1080P30 videos. This can be confirmed by running a gstreamer pipeline below, and see if it’s 30fps (It is, for both systems).

DISPLAY=:0 gst-launch-1.0 \
v4l2src device=/dev/video0 ! 'video/x-raw, width=1920, height=1080, format=UYVY, framerate=30/1' ! \
nvvidconv ! 'video/x-raw(memory:NVMM), width=320, height=240, format=I420, framerate=30/1' ! \
nvvidconv ! 'video/x-raw, width=320, height=240, format=I420, framerate=30/1' ! mix.sink_0 \
v4l2src device=/dev/video1 ! 'video/x-raw, width=1920, height=1080, format=UYVY, framerate=30/1' ! \
nvvidconv ! 'video/x-raw(memory:NVMM), width=320, height=240, format=I420, framerate=30/1' ! \
nvvidconv ! 'video/x-raw, width=320, height=240, format=I420, framerate=30/1' ! mix.sink_1 \
v4l2src device=/dev/video2 ! 'video/x-raw, width=1920, height=1080, format=UYVY, framerate=30/1' ! \
nvvidconv ! 'video/x-raw(memory:NVMM), width=320, height=240, format=I420, framerate=30/1' ! \
nvvidconv ! 'video/x-raw, width=320, height=240, format=I420, framerate=30/1' ! mix.sink_2 \
v4l2src device=/dev/video3 ! 'video/x-raw, width=1920, height=1080, format=UYVY, framerate=30/1' ! \
nvvidconv ! 'video/x-raw(memory:NVMM), width=320, height=240, format=I420, framerate=30/1' ! \
nvvidconv ! 'video/x-raw, width=320, height=240, format=I420, framerate=30/1' ! mix.sink_3 \
videomixer name=mix \
sink_0::xpos=0   sink_0::ypos=0 \
sink_1::xpos=320 sink_1::ypos=0 \
sink_2::xpos=640 sink_2::ypos=0 \
sink_3::xpos=0   sink_3::ypos=240 \
sink_4::xpos=320 sink_4::ypos=240 \
sink_5::xpos=640 sink_5::ypos=240 \
! queue ! nvoverlaysink sync=false

The bottleneck appears only when I add the cuda accelerated component, which use cudaMemcpyAsync to do the H2D copying.

Hi,

Thanks for your information.
We will discuss this issue with our internal team and let you know the results.

Thanks.

Hi,

There are three possible input types of v4l2.
Could you help to check the input type of videotestsrc / Magewell / TC358743?

Thanks.

I’m not sure what do you mean by “three possible input types of v4l2”.

Do you mean the 3 types of io-mode of v4l2 (RW/MMAP/USERPTR)?

If so, I can confirm that I used MMAP, because this seems to be the only and default mode available for a gst-launch test. When setting to other mode, the gstreamer will throw some error.

As for using videotestsrc instead of v4l2src, throughput is much higher. A H2D async copy’s throughput is 2~5GB/s

Hi xliu,

Formerly, we tested TX2 capturing video with CSI-2 bayer sensor, which has no such memcpy issue. Considering the data path inside TX2 is the same between capturing with a bayer sensor and a TC358743 (CSI-2 & VI), it’s reasonable to think that we can avoid this issue just like that, right?
We have 2 demo systems running the same application. Both systems use TX2, while one uses Magewell Pro-Capture card, and the other uses TC358743. The buffers used in capture (before feeded to cuda) are both pagable. They showed different throughput when memcpy H2D.

If executing the same gstreamer pipeline, with only one v4l2src device, on the three systems, how is the throughput comparison? Thanks!

hi Vickyy,

We don’t have the TX2-bayer_sensor system at this time. So I’ll only provide test results for the later 2. The pipeline command:

DISPLAY=:0 \
nvprof -o csi.nvprof \
gst-launch-1.0 -e \
v4l2src device=/dev/video0 ! 'video/x-raw, width=(int)1920, height=(int)1080, format=(string)UYVY, framerate=(fraction)30/1' ! queue ! st. \
videostitcher name=st map=./stitch_erect.1.map pts=./template.1.pts width=3840 height=1920 ! \
'video/x-raw, width=(int)3840, height=(int)1920, format=(string)I420, framerate=(fraction)30/1' ! \
nvoverlaysink sync=false

For both systems, I ran jetson_clocks.sh to fixate the max performance before running the pipeline. You can see the H2D throughput difference is still there in the nvprof files.

Also, I tested tegra_multimedia_api/samples/v4l2cuda on both systems, with MMAP as io mode, and the throughput difference again exists (this time 800MB/s vs 3GB/s).
pci.zip (84.9 KB)
csi.zip (106 KB)

just noticed “Pro Capture Quad HDMI” is via PCIe instead of CSI. suspect it’s because of different memory allocator operations used in v4l2 drivers of the two.

Besides clarify the issue, will “cuda_zero_copy” implementation in tegra_multimedia_api/samples/v4l2cuda be a better for you?

Yes, zero_copy would be great if I can use without changing the pipeline too much. Unfortunately, using userptr as io-mode of v4l2src is not as simple as it seems to be. In another thread (https://devtalk.nvidia.com/default/topic/1032167/?offset=10), this was discussed a bit, but without a solution. So, in order to do zero_copy, I have to give up using gstreamer at least for the capture and stitching part.

My plan now, is first to give a try to find the root cause of low throughput (with your help of course), then fix it and continue using gstreamer. If this prove to be impossible or too hard, I’ll then rewrite the capture and stitching part with zero_copy and hook it to the rest of the pipeline with appsrc.

As you mentioned “different memory allocator operations used in v4l2 drivers”, we didn’t meddle with that part of driver code. For the CSI system, we only changed the i2c driver that configurate TC358743, which has nothing to do with frame buffer allocation. For the PCIe system, we installed the driver package downloaded from Magewell’s website http://www.magewell.com/files/drivers/ProCaptureForLinuxTX2_3773.tar.gz

Does source memory is cacheable or not have direct impact on your cudaMemcpyAsync() throughput?
CSI one will call dma_alloc_coherent() and which seems uncacheable.

I don’t know if the source mem is cacheable. I think it’s ultimately decided by what cudaMemcpyAsync does. If it’s a DMA transfer, then being cacheable or non-cacheable should not matter.

I’m reading the relavent driver code of both systems, and trying to understand. Just to be aligned with you, is it vi4_fops.c for the CSI one?

Yes, the driver is under https://nv-tegra.nvidia.com/gitweb/?p=linux-4.4.git;a=tree;f=drivers/media/platform/tegra/camera/vi;hb=l4t/l4t-r28.2 .

But if the throughput difference is caused by different memory allocator operations in the v4l2 drivers, I’m still curious why you didn’t be aware of the issue with your “CSI-2 bayer sensor”.