I’ve decided to assess the zero-copy and managed memory modes in the Jetson board with an optimized matrix multiplication CUDA program that uses shared memory.
During my experiments, I noticed that both of them perform significantly slower than when the buffers were allocated using the standard cudaMalloc. Since the CPU and GPU share the same memory, I found this result to be extremely counter intuitive and decided to investigate this issue, which led me to a GTC2014 talk by Amit Rao, where he said that in zero-copy mode, the caches of GPU and CPU are disabled in zero copy mode. He didn’t explain why, so I can only assume that there’s no hardware to assure cohesion between CPU and GPU, am I correct?
Secondly, I couldn’t find anything explaining how managed memory works on Tegra devices. I noticed that, in managed memory mode, accessing the allocated buffers by the CPU is just as fast as if CPU caching was enabled; however, the matrix multiplication kernel is still just as slow as when using the zero-copy mode. Is there some material that could explain in detail what’s going on here?
Thanks for the response. Unfortunately, the programming guide leaves a few questions unanswered:
1-In zero-copy mode, what specific CPU and GPU caches are disabled in a Tegra device?
2-On Tegra devices (e.g. Jetson), how can I force the CPU and GPU to use their caches with zero-copy? (I’ll be ensuring memory cohesion manually myself by blocking the CPU execution when the GPU is busy, and maybe flushing the CPU cache when the GPU is done).
3-Does the unified memory perform zero-copy under the hood in Tegra devices? Or does it ever duplicate data, even if in small quantities?
4-Similarly to question (2), assuming that the UM mode performs zero-copy internally, how can I make sure it will enable GPU and CPU caches?
Both CPU and GPU caches are bypassed for zero-copy memory. This is likely why the matrix multiply is running slower with zero-copy memory.
Unified memory does the cache management to ensure data coherence.
The driver on Tegra does not move data for unified memory, it just does cache ops.
Unified memory map same pages to both CPU and GPU and both caches are enabled.
For the Tegra devices on which the memory is shared between GPU and CPU, is Unified Memory / Managed memory effectively Zero Copy (with caches enabled…etc) and is there any drawback to using it as opposed to device memory.
i.e. Can I DMA->Managed Memory and use it directly with GPU as if it were device memory without any penalty?
I feel the first question is still not being answered!!
I am facing the same issue, that is the kernel execution time is larger when I use the unified memory buffer when compared with the buffer allocated with cudamalloc. If the cache is not disabled in the unified buffer case then what might be other factors which leads to degradation of performance??