Zero-Copy and Managed memory on Jetson

csantos · April 26, 2016, 9:51pm

Hello,
I’ve decided to assess the zero-copy and managed memory modes in the Jetson board with an optimized matrix multiplication CUDA program that uses shared memory.
During my experiments, I noticed that both of them perform significantly slower than when the buffers were allocated using the standard cudaMalloc. Since the CPU and GPU share the same memory, I found this result to be extremely counter intuitive and decided to investigate this issue, which led me to a GTC2014 talk by Amit Rao, where he said that in zero-copy mode, the caches of GPU and CPU are disabled in zero copy mode. He didn’t explain why, so I can only assume that there’s no hardware to assure cohesion between CPU and GPU, am I correct?

Secondly, I couldn’t find anything explaining how managed memory works on Tegra devices. I noticed that, in managed memory mode, accessing the allocated buffers by the CPU is just as fast as if CPU caching was enabled; however, the matrix multiplication kernel is still just as slow as when using the zero-copy mode. Is there some material that could explain in detail what’s going on here?

kayccc · April 29, 2016, 3:17am

Hi csantos.

Please check the CUDA C Programming Guide document, you should be able to find the answer from there.

[url]http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-c-runtime[/url]

Thanks

csantos · April 29, 2016, 7:11pm

Hello kayccc,
Thanks for the response. Unfortunately, the programming guide leaves a few questions unanswered:

1-In zero-copy mode, what specific CPU and GPU caches are disabled in a Tegra device?
2-On Tegra devices (e.g. Jetson), how can I force the CPU and GPU to use their caches with zero-copy? (I’ll be ensuring memory cohesion manually myself by blocking the CPU execution when the GPU is busy, and maybe flushing the CPU cache when the GPU is done).
3-Does the unified memory perform zero-copy under the hood in Tegra devices? Or does it ever duplicate data, even if in small quantities?
4-Similarly to question (2), assuming that the UM mode performs zero-copy internally, how can I make sure it will enable GPU and CPU caches?

kayccc · May 5, 2016, 5:15am

Hi csantos,

Both CPU and GPU caches are bypassed for zero-copy memory. This is likely why the matrix multiply is running slower with zero-copy memory.

Unified memory does the cache management to ensure data coherence.
The driver on Tegra does not move data for unified memory, it just does cache ops.
Unified memory map same pages to both CPU and GPU and both caches are enabled.

Thanks

akmal.ali · November 3, 2017, 10:26am

Dear kayccc,

For the Tegra devices on which the memory is shared between GPU and CPU, is Unified Memory / Managed memory effectively Zero Copy (with caches enabled…etc) and is there any drawback to using it as opposed to device memory.

i.e. Can I DMA->Managed Memory and use it directly with GPU as if it were device memory without any penalty?

Thanks

AastaLLL · November 9, 2017, 9:35am

Hi,

On Tegra, GPU and CPU allocate memory from the same hardware.
The main difference is in sync and cache handling.

Sync:
Unified: auto-sync via GPU driver
Zero-copy: pinned memory, but may have slow access on some location.

Cache:
Unified: YES
Zero-copy: NO

We recommend Jetson user to use unified memory, and more information can be found here:
[url]Programming Guide :: CUDA Toolkit Documentation

Thanks.

akmalali · March 2, 2018, 10:01am

@kayccc

Just confirming, that GPU and CPU map the same physical pages in memory? Can these pages be remapped for DMA and hence allow DMA to these pages and then access via GPU ( invalidating the cache…)?

Akmal

AastaLLL · March 12, 2018, 7:08am

Hi,

For zero-copy memory, GPU and CPU map to the same physical location in memory.
For unified memory, GPU and CPU have its own physical location and CUDA driver will make sure the consistency.

You can allocate a pinned memory from DMA buffer.
There are some examples can in our MMAPI package.

Thanks.

narendra.v · August 14, 2018, 2:58pm

Hi,

I feel the first question is still not being answered!!

I am facing the same issue, that is the kernel execution time is larger when I use the unified memory buffer when compared with the buffer allocated with cudamalloc. If the cache is not disabled in the unified buffer case then what might be other factors which leads to degradation of performance??

Thanks.

AastaLLL · August 20, 2018, 7:25am

Hi,

There is another identical topic:
[url]https://devtalk.nvidia.com/default/topic/1038545[/url]

Let us track this on the dedicated topic.
Thanks.

Topic		Replies	Views
CPU operation is very slow on memory allocated by cudaMallocHost Jetson TX2	13	1727	October 18, 2021
Unified Memory on Jetson Platforms Jetson Xavier NX cuda	4	4596	October 18, 2021
Asynchronous memory transfer on Jetson TX1 Jetson TX1	10	1618	October 18, 2021
Using CUDA Unified memory on embedded board (psychical unified memory) CUDA Programming and Performance	6	1494	July 14, 2016
Different types of memory transfer change the execution time of kernel on Tegra x1 Jetson TX1	5	861	October 18, 2021
AGX Xavier -> Unified Memory questions Jetson AGX Xavier cuda	2	1037	June 25, 2021
Unified memory with CUDA on Jetson Nano needs memcpy? Jetson Nano cuda	9	2334	October 18, 2021
Does unified memory and zero copy always better than cudaMemcpy? CUDA Programming and Performance	4	1509	February 10, 2018
Best hardware options to reduce GPU and CPU memory transfer time? Jetson Nano	6	1058	January 19, 2022
Zero-copy still copy data? Jetson AGX Xavier	7	3722	October 18, 2021

Zero-Copy and Managed memory on Jetson

Related topics