NVIDIA is considering posting a more detailed answer, but for now I can give you some hints.
“Zero-copy” in Tegra K1 (UVM-Lite) makes some CUDA kernels faster and some kernels slower. My guess is that regular color conversions are typically faster with zero-copy, whereas Bayer formats would be slower since it accesses pixels in an irregular pattern.
Zero-copy removes the delay of transferring memory between CPU & GPU, but in Tegra K1 the zero-copy memory won’t be cached as well as regular GPU memory. So zero-copy is more likely to be faster in small simple kernels that don’t access the same group of pixels more than once, while the traditional method is more likely to be faster in large complex kernels that access the same pixels many times.
I don’t have much experience with zero-copy myself, but here are some notes that might help tweak the memory performance for Tegra K1.
Using UVM-Lite on Tegra K1 (ie: allocating memory using “cudaMallocManaged()” from the UVM Lite API, to get a memory pointer that works on both CPU & GPU):
- Don't modify the same memory on CPU & GPU at the same time. One potential strategy is to copy data into a second buffer, then while the CPU is processing 1 buffer, get the GPU to process the other buffer, they can run in parallel.
- Launching a CUDA kernel will flush ALL caches used by both the CPU GPU.
- Try using pinned memory pages instead of regular (pageable) memory, as this is often faster in Tegra K1.