Zero-Copy and Managed memory on Jetson

I’ve decided to assess the zero-copy and managed memory modes in the Jetson board with an optimized matrix multiplication CUDA program that uses shared memory.
During my experiments, I noticed that both of them perform significantly slower than when the buffers were allocated using the standard cudaMalloc. Since the CPU and GPU share the same memory, I found this result to be extremely counter intuitive and decided to investigate this issue, which led me to a GTC2014 talk by Amit Rao, where he said that in zero-copy mode, the caches of GPU and CPU are disabled in zero copy mode. He didn’t explain why, so I can only assume that there’s no hardware to assure cohesion between CPU and GPU, am I correct?

Secondly, I couldn’t find anything explaining how managed memory works on Tegra devices. I noticed that, in managed memory mode, accessing the allocated buffers by the CPU is just as fast as if CPU caching was enabled; however, the matrix multiplication kernel is still just as slow as when using the zero-copy mode. Is there some material that could explain in detail what’s going on here?

Hi csantos.

Please check the CUDA C Programming Guide document, you should be able to find the answer from there.


Hello kayccc,
Thanks for the response. Unfortunately, the programming guide leaves a few questions unanswered:

1-In zero-copy mode, what specific CPU and GPU caches are disabled in a Tegra device?
2-On Tegra devices (e.g. Jetson), how can I force the CPU and GPU to use their caches with zero-copy? (I’ll be ensuring memory cohesion manually myself by blocking the CPU execution when the GPU is busy, and maybe flushing the CPU cache when the GPU is done).
3-Does the unified memory perform zero-copy under the hood in Tegra devices? Or does it ever duplicate data, even if in small quantities?
4-Similarly to question (2), assuming that the UM mode performs zero-copy internally, how can I make sure it will enable GPU and CPU caches?

Hi csantos,

Both CPU and GPU caches are bypassed for zero-copy memory. This is likely why the matrix multiply is running slower with zero-copy memory.

Unified memory does the cache management to ensure data coherence.
The driver on Tegra does not move data for unified memory, it just does cache ops.
Unified memory map same pages to both CPU and GPU and both caches are enabled.


Dear kayccc,

For the Tegra devices on which the memory is shared between GPU and CPU, is Unified Memory / Managed memory effectively Zero Copy (with caches enabled…etc) and is there any drawback to using it as opposed to device memory.

i.e. Can I DMA->Managed Memory and use it directly with GPU as if it were device memory without any penalty?



On Tegra, GPU and CPU allocate memory from the same hardware.
The main difference is in sync and cache handling.

Unified: auto-sync via GPU driver
Zero-copy: pinned memory, but may have slow access on some location.

Unified: YES
Zero-copy: NO

We recommend Jetson user to use unified memory, and more information can be found here:



Just confirming, that GPU and CPU map the same physical pages in memory? Can these pages be remapped for DMA and hence allow DMA to these pages and then access via GPU ( invalidating the cache…)?



For zero-copy memory, GPU and CPU map to the same physical location in memory.
For unified memory, GPU and CPU have its own physical location and CUDA driver will make sure the consistency.

You can allocate a pinned memory from DMA buffer.
There are some examples can in our MMAPI package.



I feel the first question is still not being answered!!

I am facing the same issue, that is the kernel execution time is larger when I use the unified memory buffer when compared with the buffer allocated with cudamalloc. If the cache is not disabled in the unified buffer case then what might be other factors which leads to degradation of performance??



There is another identical topic:

Let us track this on the dedicated topic.