Memory-type quesions

TrailingStop · April 21, 2023, 7:52am

Hi,

I read about the different memory types used for the data-exchange between host and gpu. I have some problems to see the differences between these types:

pinned memory
page locked memory
zero copy memory

from what I read so far they all behave the same - is it possibel that all these types are names for just a single memory type? Different names for the same thing?

Another question is regarding the async memory functions. If I want to use a async memcopy function I have to use ‘page locked memory’. But I think that page locked memory can be read/write from both, host and gpu. They uses the same memory. So what is the point to use a memcopy function if it is the same memory? Why should I copy memory that is already accessable to the gpu?

Thanks,
Daniel

cbuchner1 · April 21, 2023, 10:32am

page locked/pinned: guaranteed not to be paged out to virtual memory. This guarantee speeds up device to host memory copies as DMA can be used for the entire transfer. This memory type does not automatically have an address on the device, unless explicitly requested during allocation (cudaHostRegisterMapped flag)

unified: the same unified address space applies to host and device memory. Driver+hardware take care of paging in and out the memory automatically to host or device. Does have some extra overhead.

zero copy: Here I believe the device would directly write data to host memory via PCIe transfers. But I am not 100% sure. May suffer from high latencies due to the PCIe overhead.

TrailingStop · April 21, 2023, 11:13am

Hi,

thanks a lot for your fast reply. From the documentation of ‘cuMemAllocHost’ I understand that the device and the host can access the allocated memory directly ( CUDA Driver API :: CUDA Toolkit Documentation (nvidia.com)):

“Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc().”

As far as I understand the docs I can use this memory on the host side as well as on the device side and any modification will be available and visible on the ‘other side’, meaning: If the host modifies the memory the device will see this modification without any call to a memcopy function. And vice versa.

But I’m not sure about this and I see some performance impacts on pascal GPUs for accessing this memory - no impact on 30xx GPUs so far noticed.

cbuchner1 · April 21, 2023, 11:22am

Yes, if host modifies the memory, the device will see this modification. However there will be a memcpy done by the driver over PCIe bus after a page fault on the device side. The driver performs on-demand paging between host and device memory.

Considering that Pascal was the first nVidia GPU architecture supporting unified memory feature, there are likely differences between Pascal GPUs and the 3000 series that make operation with unified memory less efficient on Pascal.

TrailingStop · April 21, 2023, 2:35pm

Thanks again, but this is even more confusing.

The doc list a function to allocate unified memory: cuMemAllocManaged
CUDA Driver API :: CUDA Toolkit Documentation (nvidia.com)

I use cuMemAllocHost which allocated ‘page locked’ memory. Is unified memory = page locked memory?

Sorry to bother, but this is exactly my problem in understanding. There are a lot of different functions and also a lot of names for these memory, but I do not see any clear assignment and relation:

page locked → cuMemAllocHost (maybe same as pinned? maybe same as zero copy memory???)
unified memory → cuMemAllocManaged ? Memory that is managed by the ‘Unified memory System’ - does it work/behave like page locked?

Robert_Crovella · April 21, 2023, 2:39pm

No, cbuchner introduced unified memory into this discussion. For the purpose of the discussion I see here, pinned memory == page locked memory == zero copy memory. Unified memory is quite different.

However, both pinned/pagelocked/zero-copy memory and unified memory (UM) have the characteristic that the allocation is accessible directly from host and device code. This is generally not true for allocations done using an ordinary host allocator such as new or malloc and also not true for ordinary device memory allocations using cudaMalloc or cuMemAlloc.

Therefore, people who are interested in using pinned/pagelocked/zero-copy memory because they want to directly access it using the same pointer from either host or device code, may also be interested in UM.

Apart from the above discussion, if you want to use an async transfer from/to host memory, and you want that async transfer to be able to overlap with other async operations (such as e.g. kernels), then the host memory participating in the async transfer must be pinned/pagelocked/zero-copy. This is really a separate idea from the characteristic of being accessed directly from host or device. pinned/pagelocked/zero-copy memory serves multiple distinct purposes in CUDA.

TrailingStop · April 21, 2023, 2:44pm

Yes, thanks a lot. This is the information that I was looking for. Sometimes it is very confusing to ready different expressions for the same thing.

Thanks a again to all.

system · May 5, 2023, 2:45pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
transfer from pageable host memory to page-locked host memory? CUDA Programming and Performance	3	1062	June 1, 2012
question about page locked memory CUDA Programming and Performance	2	8870	April 21, 2009
Difference between cudaMallocManaged and zero copy memory function CUDA Programming and Performance	1	5902	March 1, 2018
Async transfers with non-cuda host memory using page-locked memory not cuda memory CUDA Programming and Performance	5	11645	July 4, 2008
cudaHostAllocMapped CUDA Programming and Performance	5	7994	October 15, 2009
Page Locked Memory CUDA Programming and Performance	3	994	May 5, 2011
Does pinned memory can accessed by Device? CUDA Programming and Performance	4	1594	March 18, 2024
Unified Memory vs Pinned Host Memory vs GPU Global Memory CUDA Programming and Performance	9	8965	June 1, 2022
Why can page-locked Memory be acc in memcpy funciton CUDA Programming and Performance	1	3534	April 6, 2009
uncached memory created by cudaHostAlloc and cudaMemcpyAsync issues on TX1 Jetson TX1	3	1743	July 15, 2016

Memory-type quesions

Related topics