Using the Jetson-TK1 and it’s custom NVIDIA build opencv library I’ve written a piece of code that demosaics a Bayer image and resizes it using two methods:
cv::gpu::CudaMem with ALLOC_ZEROCOPY
I would expect the second method to be much faster as it makes use of the unified memory architecture in the TK1. But it’s 35% worse! Can anyone explain why that is? Also, when I disable both the resize and the demosaic the second method performs faster.
Did you max out the CPU and GPU clocks before doing the performance tests? By default they are adjusted frequently based on the load and on short tests they can be basically anything between 100 Mhz and 2000 MHz.
NVIDIA is considering posting a more detailed answer, but for now I can give you some hints.
“Zero-copy” in Tegra K1 (UVM-Lite) makes some CUDA kernels faster and some kernels slower. My guess is that regular color conversions are typically faster with zero-copy, whereas Bayer formats would be slower since it accesses pixels in an irregular pattern.
Zero-copy removes the delay of transferring memory between CPU & GPU, but in Tegra K1 the zero-copy memory won’t be cached as well as regular GPU memory. So zero-copy is more likely to be faster in small simple kernels that don’t access the same group of pixels more than once, while the traditional method is more likely to be faster in large complex kernels that access the same pixels many times.
I don’t have much experience with zero-copy myself, but here are some notes that might help tweak the memory performance for Tegra K1.
Using UVM-Lite on Tegra K1 (ie: allocating memory using “cudaMallocManaged()” from the UVM Lite API, to get a memory pointer that works on both CPU & GPU):
Don't modify the same memory on CPU & GPU at the same time. One potential strategy is to copy data into a second buffer, then while the CPU is processing 1 buffer, get the GPU to process the other buffer, they can run in parallel.
Launching a CUDA kernel will flush ALL caches used by both the CPU GPU.
Try using pinned memory pages instead of regular (pageable) memory, as this is often faster in Tegra K1.
I haven’t done a detailed empirical study but in my experience it’s faster to use pinned memory at the input & output of your processing chain and for all intermediate stages use buffers allocated via cudaMalloc( … ).
So far managed memory seems good for convenience rather than raw performance. Perhaps this will change or is not true for certain use cases.