I’ve been running CUDA and NPP (NVidia Performance Primitives) on the Jetson K1 for a few weeks now, and I’ve been noting slight but certainly not great performance improvements.
For instance, the nppiYCrCb422ToYCbCr420_8u_C2P3R packed YUV (422) to planar YUV (420) conversion primitive yields only a 10% performance improvement over a naive C++ implementation compiled using G++ 4.8 and -O3.
The CUDA version runs from device to host (pinned) memory, which I assume in the Tegra K1 architecture shouldn’t account for much performance degradation… are these numbers normal?
Thanks in advance,
You should be able to reach at least 50% utilization of the GPU with such a simple problem IMO.
By utilization I mean bandwidth (this is definitely a bandwidth bound kernel).
How to answer “are these numbers normal”:
How much data are you passing in?
->YCbCr422 for 720P should be something like ~3400 bytes (over columns) * 720 = 2.448 MB (roughly), right?
How much data are you writing?
->I.e. how large is the output YCbCr420 ?
Compute [(data_in + data_out)/( total_kernel_time )] * (1E-9) => X GB/s .
Theoretical max on TK1 is 14.7 GB/s , what is your X / 14.7 ?
About measuring time:
-> Make sure to use nvprof to achieve accurate timing on your kernel.
-> Run the kernel several times to get accurate timing and avoid “warm up” effects
-> set your GPU and memory clocks (gbus and emc) to max or other prefered value to avoid fluctuations (downclocking when idle)
-> make sure you are not measuring initialization time (cudaMalloc(…), cudaFree(…) etc), first call to cuda API can often stall significantly.
thanks for your comprehensive response. Here are more details about my scenario:
I’m running the pixel format conversion on 1280x720p images @ 60Hz. That’s precisely 1.75 MB per image, and about 105 MB per second at 60 FPS.
The output format is I420, so writes are 1.31 MB per image, 79 MB per second at 60 FPS.
That amounts to a total data throughput of 105+79=184 MB per second.
Considering the 14.7 GB/s peak, I could run at a max performance of 81.8 conversions per second, which amounts to 12.22 milliseconds per conversion. This makes even less sense for me, since nvprof reports an average time of 4.4ms per ImageFormatConversionKernel_4XX_8u execution… please explain :(
Answering your questions:
- I’m effectively measuring kernel execution time using nvprof (great tool btw!)
- All measurements are taking after 5-10 seconds running @ 60 FPS (so, 300-600 samples)
- I haven’t tried the memory clock boost, will do and post results here.
- I’m using the nvprof “Avg” column for kernel execution time, which I assume only reflects bare kernel running times, not previous allocations, etc.
My specific measurements are these:
- nppiYCrCb422ToYCbCr420_8u_C2P3R: ~4.4 milliseconds (nvprof avg)
- Naive software implementation: 5 milliseconds (measured using os timing syscalls)
Perhaps my software implementation is too naive in contrast with NPP’s in terms of subsampling?
Thanks for your help,
ok, so given your data points we can now tell a little bit more.
total_data_in = 1843200 bytes => 0.0018432 GB
total_data_out ~ 0.001373635 GB
total_data = 0.0032168 GB
And you were processing that at an average of 4.4 ms, which yields bandwidth usage at:
0.0032168 GB / 0.0044s => 0.731 GB/s
Now that means that you are using about 0.731 GB/s / 14.7 GB/s => 4.9% of the available bandwidth
Conclusion: You should be able to run much faster, probably 10x faster, something is not optimal here.
For instance with higher / normal utilization levels you should see timings around:
20.00% 1.09 ms
40.00% 0.54 ms
60.00% 0.36 ms
-> Can you try switching from the pinned memory to a pure cudaMalloc(…) buffer to help exclude pitfalls.
-> Login as root user and set the shader and memory clocks to max, http://elinux.org/Jetson/Performance
-> Can you also dump the nvprof output, open up the nvprof_output in the CUDA Visual Profiler on a host system?
>> nvprof -o nvprof_dump ./your_application
-> What does the timeline look like you you open nvprof_dump in the Visual Profiler ?
disclaimer I’m on vacation, hope I didnt fat finger my calculator
Increasing the shader and memory clocks to max did the trick, now the pixel conversion is taking about 0.6 milliseconds, and my whole pipeline runs in about 1.5 milliseconds.
Switching from pinned to device memory is something I tried a few days ago, but had no performance effect whatsoever, as expected in a unified memory architecture such as the Tegra K1.
In the visual profiler, my timeline looks pretty much as expected, with various kernels running sequentially for each frame (can’t parallelise much because each kernel output is the next kernel’s input, sadly).
In any case, the numbers now really match my expectations for the Tegra K1 GPGPU performance in image processing, beating the $%&@ out of software implementations.
Thanks again and enjoy your vacation!!!
Ok, nice to hear! Those numbers seem much more reasonable.