Pixel format conversion performance

EmilioG · July 26, 2016, 5:18pm

Hi,

I’ve been running CUDA and NPP (NVidia Performance Primitives) on the Jetson K1 for a few weeks now, and I’ve been noting slight but certainly not great performance improvements.

For instance, the nppiYCrCb422ToYCbCr420_8u_C2P3R packed YUV (422) to planar YUV (420) conversion primitive yields only a 10% performance improvement over a naive C++ implementation compiled using G++ 4.8 and -O3.

The CUDA version runs from device to host (pinned) memory, which I assume in the Tegra K1 architecture shouldn’t account for much performance degradation… are these numbers normal?

Thanks in advance,
E.

Jimmy_Pettersson · July 28, 2016, 7:08am

You should be able to reach at least 50% utilization of the GPU with such a simple problem IMO.

By utilization I mean bandwidth (this is definitely a bandwidth bound kernel).

How to answer “are these numbers normal”:

How much data are you passing in?

->YCbCr422 for 720P should be something like ~3400 bytes (over columns) * 720 = 2.448 MB (roughly), right?

How much data are you writing?

->I.e. how large is the output YCbCr420 ?

Compute [(data_in + data_out)/( total_kernel_time )] * (1E-9) => X GB/s .

Theoretical max on TK1 is 14.7 GB/s , what is your X / 14.7 ?

About measuring time:
→ Make sure to use nvprof to achieve accurate timing on your kernel.

nvprof ./your_application
→ Run the kernel several times to get accurate timing and avoid “warm up” effects
→ set your GPU and memory clocks (gbus and emc) to max or other prefered value to avoid fluctuations (downclocking when idle)
→ make sure you are not measuring initialization time (cudaMalloc(…), cudaFree(…) etc), first call to cuda API can often stall significantly.

EmilioG · July 28, 2016, 8:32am

Hello Jimmy,

thanks for your comprehensive response. Here are more details about my scenario:

I’m running the pixel format conversion on 1280x720p images @ 60Hz. That’s precisely 1.75 MB per image, and about 105 MB per second at 60 FPS.

The output format is I420, so writes are 1.31 MB per image, 79 MB per second at 60 FPS.

That amounts to a total data throughput of 105+79=184 MB per second.

Considering the 14.7 GB/s peak, I could run at a max performance of 81.8 conversions per second, which amounts to 12.22 milliseconds per conversion. This makes even less sense for me, since nvprof reports an average time of 4.4ms per ImageFormatConversionKernel_4XX_8u execution… please explain :(

Answering your questions:

I’m effectively measuring kernel execution time using nvprof (great tool btw!)
All measurements are taking after 5-10 seconds running @ 60 FPS (so, 300-600 samples)
I haven’t tried the memory clock boost, will do and post results here.
I’m using the nvprof “Avg” column for kernel execution time, which I assume only reflects bare kernel running times, not previous allocations, etc.

My specific measurements are these:

nppiYCrCb422ToYCbCr420_8u_C2P3R: ~4.4 milliseconds (nvprof avg)
Naive software implementation: 5 milliseconds (measured using os timing syscalls)

Perhaps my software implementation is too naive in contrast with NPP’s in terms of subsampling?

Thanks for your help,
E.

Jimmy_Pettersson · July 28, 2016, 4:23pm

ok, so given your data points we can now tell a little bit more.

So :

total_data_in = 1843200 bytes => 0.0018432 GB
total_data_out ~ 0.001373635 GB

total_data = 0.0032168 GB

And you were processing that at an average of 4.4 ms, which yields bandwidth usage at:

0.0032168 GB / 0.0044s => 0.731 GB/s

Now that means that you are using about 0.731 GB/s / 14.7 GB/s => 4.9% of the available bandwidth

Conclusion: You should be able to run much faster, probably 10x faster, something is not optimal here.

For instance with higher / normal utilization levels you should see timings around:

utilization time
20.00% 1.09 ms
40.00% 0.54 ms
60.00% 0.36 ms

Going forward:

→ Can you try switching from the pinned memory to a pure cudaMalloc(…) buffer to help exclude pitfalls.
→ Login as root user and set the shader and memory clocks to max, Jetson/Performance - eLinux.org

EDIT:
→ Can you also dump the nvprof output, open up the nvprof_output in the CUDA Visual Profiler on a host system?

>> nvprof -o nvprof_dump ./your_application

→ What does the timeline look like you you open nvprof_dump in the Visual Profiler ?

disclaimer I’m on vacation, hope I didnt fat finger my calculator

EmilioG · July 29, 2016, 9:59am

Hello Jimmy,

Increasing the shader and memory clocks to max did the trick, now the pixel conversion is taking about 0.6 milliseconds, and my whole pipeline runs in about 1.5 milliseconds.

Switching from pinned to device memory is something I tried a few days ago, but had no performance effect whatsoever, as expected in a unified memory architecture such as the Tegra K1.

In the visual profiler, my timeline looks pretty much as expected, with various kernels running sequentially for each frame (can’t parallelise much because each kernel output is the next kernel’s input, sadly).

In any case, the numbers now really match my expectations for the Tegra K1 GPGPU performance in image processing, beating the $%&@ out of software implementations.

Thanks again and enjoy your vacation!!!

Jimmy_Pettersson · July 30, 2016, 4:13am

Ok, nice to hear! Those numbers seem much more reasonable.

Thanks
Jimmy

Topic		Replies	Views
CUDA kernel and Xavier performance Jetson AGX Xavier	8	971	October 18, 2021
Jpeg encode performances GPU-Accelerated Libraries	3	1703	August 24, 2016
Jetson TK1 performance Jetson TK1	18	6534	June 18, 2014
Color Format Conversion in Jetson Jetson TK1	1	902	July 20, 2015
Jetson TK1 - OpenCV4Tegra - YUV to RGBA conversion Jetson TK1 opencv	3	2633	November 17, 2015
NPP performance problems not impressing at all... CUDA Programming and Performance	0	7287	February 25, 2011
Cuda - NPP Developers.. CUDA Programming and Performance	10	5761	August 2, 2010
Jetson TK1 performance bottleneck CUDA Programming and Performance	4	2754	February 10, 2016
Jetson TK1 - color space conversion (YUV422 to I420) Jetson TK1	2	1743	December 16, 2015
Performance optimization help Jetson TX2	19	1188	October 18, 2021

Pixel format conversion performance

Related topics