Maximizing Unified Memory Performance in CUDA

Originally published at: Maximizing Unified Memory Performance in CUDA | NVIDIA Technical Blog

Many of today’s applications process large volumes of data. While GPU architectures have very fast HBM or GDDR memory, they have limited capacity. Making the most of GPU performance requires the data to be as close to the GPU as possible. This is especially important for applications that iterate over the same data multiple times…

Nice! I have been waiting many years for this. Do you also intend to make this hardware feature available via OpenCL?

Thanks Nikolay! This article is very useful. I had a question. To clarify pushing device-host prefetches in a busy stream helps since it allows for prioritizing the host-device prefetches to use the CPU - since the CPU is only required for unmapping page table entries from the CPU. Is that understanding correct? Secondly, you mentioned that the Linux UVM driver is opensource - could you please provide a pointer to that?

You said "the input data (ptr) is allocated with cudaMallocManaged or cudaMallocHost and initially populated on the CPU." But the CUDA programming guide said the data allocated via cudaMallocManaged is hosted in physical GPU storage. Any conflicts?

Currently we do not have any plans of adding support for the feature for OpenCL.

Data allocated via cudaMallocManaged is hosted initially in physical GPU memory only for pre-Pascal devices. On Pascal and beyond when cudaMallocManaged is called the data is not populated until first touch, so it could be on the CPU or the GPU. In my setup I write the data on the CPU to make sure it's resident in system memory before running any experiments. Thanks for reporting the issue with the programming guide, we'll fix the documentation.

Yes, your understanding is correct. When pushing the device-to-host prefetches in a busy stream they will be deferred which allows you to submit host-to-device prefetches right away that can release the CPU sooner. In this case the two prefetches can be overlapped. If submitted to an idle stream, the device-to-host prefetches will use the CPU during the whole duration of the prefetch call and it won't be possible to launch prefetches in other direction.

Regarding Unified Memory sources - please go here to download a public driver package: http://www.nvidia.com/objec.... You can then extract it via "sh NV*.run -x". Go to the kernel subdirectory within the extracted files directory, and you'll find all of the Unified Memory source code.

This article is very helpful : ) Are we able to do this with regular malloc calls (heterogeneous memory management)? Is there a performance difference?

You said that in this blog "Note that the Linux Unified Memory driver is open source, so keen developers can review what happens under the hood". Could you give me a link to find the source? Many thanks!

When trying to overlap one-way host-to-device prefetch, you recommended we launch kernels first, then call the prefetch. This becuase host-to-device prefetch is not completely asynchronous ( the part of changing TLB within CPU is synchronous ). But isn't this under the assumption that the memory fetch will complete before the kernel accesses the unified memory region? How can we assume this? If we don't assume this, then there would be no poin
t in prefetching because if kernel accesses the unified memory region before prefetch is complete, it will raise a page fault. Please share with me what you think.
Thanks!

Glad to hear that you found this useful. HMM driver is not production ready currently, so I don't have any performance numbers to report yet, but stay tuned for updates!

Please see the first comment http://disq.us/p/1p7hvwb for instructions on how to obtain the sources.

Hi Joong, your understanding is correct - if the kernel is trying to access memory that has not been prefetched, then it will generate page faults. The idea is to set up a pipeline that will chunk the dataset into A1, A2, ... parts, so in the first step 1) we prefetch A1, wait until it's done, then 2) submit the kernel working on A1, and right after that issue prefetching of A2; and continue for all the remaining chunks. Note that with cudaMemcpy you can submit it before or after the kernel, since it's completely asynchronous, but with the prefetches you need to create your pipeline in such a way that you're not blocked by the CPU work in the prefetching calls. Hope it helps!

Hello Nikolay, thanks for your reply! Your explanations, especially how you detail about the return behavior and stream behavior of different prefetches, was REALLY helpful! Do you think you can explain such details regarding cudaMemcpyAsync ( host-to-device, device-to-host, device-to-device ) and how many memory transactions can occur in parallel ( in respect of the memory engines )? I have read all the posts from Nvidia regarding memcpy, but they don't go into as much detail as you do with prefetch.
Thanks again!

Hi Joong, sorry for delayed response. cudaMemcpyAsync works differently than cudaMemPrefetchAsync, since the former does not require any CPU work if the source and the destination are pinned/mapped buffers (in system memory or GPU memory). In that case, the CPU just submits a task for the copy engine to execute the copy. Depending on how many copy engines are available on the corresponding GPU, you could run multiple copies in parallel. Hopefully, it clarifies your question!

Hello Nikolay, thanks a lot! yes, it really helped.

Hi Nikolay, I have a concern about Warp-Per-Page Approach: it seems that this reduces the number of active warps, and thus reduce the degree of parallelism.

Actually in my experience, I observed that the gld_throughput of Warp-Per-Page Approach is lower than the normal kernel (without prefetching), though the execution time of the normal kernel is still longer due to more page faults. But I'm still concerned that Warp-Per-Page Approach may not be well generalizable to other applications.

Is my concern resonable?

I'm new to CUDA, so my thought may not make sense.

Each copy engine on the GPU can execute a separate memory transfer (host registered-to-device, device-to-host registered). Note, that you need to register (page lock and set mappings) the host buffer to use the copy engines for cudaMemcpy (see cudaHostRegister), otherwise the CUDA driver will create a pipeline and stage the memory transfers from cudaMemcpy through a temporary small pinned buffer. The latter may have an impact on performance since the CPU will be involved to copy the memory, and for small sizes there could be overhead from setting up the pipeline. If you're copying from/to registered host buffers then it should be fairly easy to achieve good overlap by using separate CUDA streams for cudaMemcpyAsync. Sorry for delayed reply, hope this helps!

Hi Harry, I agree that this is a valid concern and the warp-per-page approach may not be generally applicable to many applications. The point is to reduce the number of page faults, if it's impossible to eliminate them completely. We're constantly working on improving the internal prefetching mechanism in the driver to support most common application patterns, so hopefully that would alleviate the need to change the parallelization strategy. The warp-per-page approach I described is just an interim solution that also serves educational purpose to showcase profiling stats.