Memory-type quesions

Hi,

I read about the different memory types used for the data-exchange between host and gpu. I have some problems to see the differences between these types:

  • pinned memory
  • page locked memory
  • zero copy memory

from what I read so far they all behave the same - is it possibel that all these types are names for just a single memory type? Different names for the same thing?

Another question is regarding the async memory functions. If I want to use a async memcopy function I have to use ‘page locked memory’. But I think that page locked memory can be read/write from both, host and gpu. They uses the same memory. So what is the point to use a memcopy function if it is the same memory? Why should I copy memory that is already accessable to the gpu?

Thanks,
Daniel

page locked/pinned: guaranteed not to be paged out to virtual memory. This guarantee speeds up device to host memory copies as DMA can be used for the entire transfer. This memory type does not automatically have an address on the device, unless explicitly requested during allocation (cudaHostRegisterMapped flag)

unified: the same unified address space applies to host and device memory. Driver+hardware take care of paging in and out the memory automatically to host or device. Does have some extra overhead.

zero copy: Here I believe the device would directly write data to host memory via PCIe transfers. But I am not 100% sure. May suffer from high latencies due to the PCIe overhead.

Hi,

thanks a lot for your fast reply. From the documentation of ‘cuMemAllocHost’ I understand that the device and the host can access the allocated memory directly ( CUDA Driver API :: CUDA Toolkit Documentation (nvidia.com)):

“Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc().”

As far as I understand the docs I can use this memory on the host side as well as on the device side and any modification will be available and visible on the ‘other side’, meaning: If the host modifies the memory the device will see this modification without any call to a memcopy function. And vice versa.

But I’m not sure about this and I see some performance impacts on pascal GPUs for accessing this memory - no impact on 30xx GPUs so far noticed.

Yes, if host modifies the memory, the device will see this modification. However there will be a memcpy done by the driver over PCIe bus after a page fault on the device side. The driver performs on-demand paging between host and device memory.

Considering that Pascal was the first nVidia GPU architecture supporting unified memory feature, there are likely differences between Pascal GPUs and the 3000 series that make operation with unified memory less efficient on Pascal.

Thanks again, but this is even more confusing.

The doc list a function to allocate unified memory: cuMemAllocManaged
CUDA Driver API :: CUDA Toolkit Documentation (nvidia.com)

I use cuMemAllocHost which allocated ‘page locked’ memory. Is unified memory = page locked memory?

Sorry to bother, but this is exactly my problem in understanding. There are a lot of different functions and also a lot of names for these memory, but I do not see any clear assignment and relation:

  1. page locked → cuMemAllocHost (maybe same as pinned? maybe same as zero copy memory???)
  2. unified memory → cuMemAllocManaged ? Memory that is managed by the ‘Unified memory System’ - does it work/behave like page locked?

No, cbuchner introduced unified memory into this discussion. For the purpose of the discussion I see here, pinned memory == page locked memory == zero copy memory. Unified memory is quite different.

However, both pinned/pagelocked/zero-copy memory and unified memory (UM) have the characteristic that the allocation is accessible directly from host and device code. This is generally not true for allocations done using an ordinary host allocator such as new or malloc and also not true for ordinary device memory allocations using cudaMalloc or cuMemAlloc.

Therefore, people who are interested in using pinned/pagelocked/zero-copy memory because they want to directly access it using the same pointer from either host or device code, may also be interested in UM.

Apart from the above discussion, if you want to use an async transfer from/to host memory, and you want that async transfer to be able to overlap with other async operations (such as e.g. kernels), then the host memory participating in the async transfer must be pinned/pagelocked/zero-copy. This is really a separate idea from the characteristic of being accessed directly from host or device. pinned/pagelocked/zero-copy memory serves multiple distinct purposes in CUDA.

Yes, thanks a lot. This is the information that I was looking for. Sometimes it is very confusing to ready different expressions for the same thing.

Thanks a again to all.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.