Maximizing Unified Memory Performance in CUDA

jwitsoe · November 20, 2017, 3:38am

Originally published at: Maximizing Unified Memory Performance in CUDA | NVIDIA Technical Blog

Many of today’s applications process large volumes of data. While GPU architectures have very fast HBM or GDDR memory, they have limited capacity. Making the most of GPU performance requires the data to be as close to the GPU as possible. This is especially important for applications that iterate over the same data multiple times…

anon96901432 · November 21, 2017, 7:14am

Nice! I have been waiting many years for this. Do you also intend to make this hardware feature available via OpenCL?

anon53241448 · November 30, 2017, 4:09am

Thanks Nikolay! This article is very useful. I had a question. To clarify pushing device-host prefetches in a busy stream helps since it allows for prioritizing the host-device prefetches to use the CPU - since the CPU is only required for unmapping page table entries from the CPU. Is that understanding correct? Secondly, you mentioned that the Linux UVM driver is opensource - could you please provide a pointer to that?

anon47252039 · January 7, 2018, 1:06am

You said "the input data (ptr) is allocated with cudaMallocManaged or cudaMallocHost and initially populated on the CPU." But the CUDA programming guide said the data allocated via cudaMallocManaged is hosted in physical GPU storage. Any conflicts?

anon15011306 · January 10, 2018, 3:38pm

Currently we do not have any plans of adding support for the feature for OpenCL.

anon15011306 · January 10, 2018, 3:44pm

Data allocated via cudaMallocManaged is hosted initially in physical GPU memory only for pre-Pascal devices. On Pascal and beyond when cudaMallocManaged is called the data is not populated until first touch, so it could be on the CPU or the GPU. In my setup I write the data on the CPU to make sure it's resident in system memory before running any experiments. Thanks for reporting the issue with the programming guide, we'll fix the documentation.

anon15011306 · January 10, 2018, 3:59pm

Yes, your understanding is correct. When pushing the device-to-host prefetches in a busy stream they will be deferred which allows you to submit host-to-device prefetches right away that can release the CPU sooner. In this case the two prefetches can be overlapped. If submitted to an idle stream, the device-to-host prefetches will use the CPU during the whole duration of the prefetch call and it won't be possible to launch prefetches in other direction.

Regarding Unified Memory sources - please go here to download a public driver package: http://www.nvidia.com/objec.... You can then extract it via "sh NV*.run -x". Go to the kernel subdirectory within the extracted files directory, and you'll find all of the Unified Memory source code.

anon29766179 · May 23, 2018, 8:23pm

This article is very helpful : ) Are we able to do this with regular malloc calls (heterogeneous memory management)? Is there a performance difference?

anon60905429 · October 4, 2018, 8:36am

You said that in this blog "Note that the Linux Unified Memory driver is open source, so keen developers can review what happens under the hood". Could you give me a link to find the source? Many thanks!

anon70852881 · November 12, 2018, 2:53am

When trying to overlap one-way host-to-device prefetch, you recommended we launch kernels first, then call the prefetch. This becuase host-to-device prefetch is not completely asynchronous ( the part of changing TLB within CPU is synchronous ). But isn't this under the assumption that the memory fetch will complete before the kernel accesses the unified memory region? How can we assume this? If we don't assume this, then there would be no poin
t in prefetching because if kernel accesses the unified memory region before prefetch is complete, it will raise a page fault. Please share with me what you think.
Thanks!

anon15011306 · November 19, 2018, 4:19pm

Glad to hear that you found this useful. HMM driver is not production ready currently, so I don't have any performance numbers to report yet, but stay tuned for updates!

anon15011306 · November 19, 2018, 4:20pm

Please see the first comment http://disq.us/p/1p7hvwb for instructions on how to obtain the sources.

anon15011306 · November 19, 2018, 4:37pm

Hi Joong, your understanding is correct - if the kernel is trying to access memory that has not been prefetched, then it will generate page faults. The idea is to set up a pipeline that will chunk the dataset into A1, A2, ... parts, so in the first step 1) we prefetch A1, wait until it's done, then 2) submit the kernel working on A1, and right after that issue prefetching of A2; and continue for all the remaining chunks. Note that with cudaMemcpy you can submit it before or after the kernel, since it's completely asynchronous, but with the prefetches you need to create your pipeline in such a way that you're not blocked by the CPU work in the prefetching calls. Hope it helps!

anon70852881 · November 21, 2018, 12:26am

Hello Nikolay, thanks for your reply! Your explanations, especially how you detail about the return behavior and stream behavior of different prefetches, was REALLY helpful! Do you think you can explain such details regarding cudaMemcpyAsync ( host-to-device, device-to-host, device-to-device ) and how many memory transactions can occur in parallel ( in respect of the memory engines )? I have read all the posts from Nvidia regarding memcpy, but they don't go into as much detail as you do with prefetch.
Thanks again!

anon15011306 · January 15, 2019, 9:28pm

Hi Joong, sorry for delayed response. cudaMemcpyAsync works differently than cudaMemPrefetchAsync, since the former does not require any CPU work if the source and the destination are pinned/mapped buffers (in system memory or GPU memory). In that case, the CPU just submits a task for the copy engine to execute the copy. Depending on how many copy engines are available on the corresponding GPU, you could run multiple copies in parallel. Hopefully, it clarifies your question!

anon70852881 · January 21, 2019, 5:01am

Hello Nikolay, thanks a lot! yes, it really helped.

anon55719402 · April 22, 2019, 10:32pm

Hi Nikolay, I have a concern about Warp-Per-Page Approach: it seems that this reduces the number of active warps, and thus reduce the degree of parallelism.

Actually in my experience, I observed that the gld_throughput of Warp-Per-Page Approach is lower than the normal kernel (without prefetching), though the execution time of the normal kernel is still longer due to more page faults. But I'm still concerned that Warp-Per-Page Approach may not be well generalizable to other applications.

Is my concern resonable?

I'm new to CUDA, so my thought may not make sense.

anon15011306 · May 14, 2019, 2:16pm

Each copy engine on the GPU can execute a separate memory transfer (host registered-to-device, device-to-host registered). Note, that you need to register (page lock and set mappings) the host buffer to use the copy engines for cudaMemcpy (see cudaHostRegister), otherwise the CUDA driver will create a pipeline and stage the memory transfers from cudaMemcpy through a temporary small pinned buffer. The latter may have an impact on performance since the CPU will be involved to copy the memory, and for small sizes there could be overhead from setting up the pipeline. If you're copying from/to registered host buffers then it should be fairly easy to achieve good overlap by using separate CUDA streams for cudaMemcpyAsync. Sorry for delayed reply, hope this helps!

anon15011306 · May 14, 2019, 2:20pm

Hi Harry, I agree that this is a valid concern and the warp-per-page approach may not be generally applicable to many applications. The point is to reduce the number of page faults, if it's impossible to eliminate them completely. We're constantly working on improving the internal prefetching mechanism in the driver to support most common application patterns, so hopefully that would alleviate the need to change the parallelization strategy. The warp-per-page approach I described is just an interim solution that also serves educational purpose to showcase profiling stats.

Topic		Replies	Views
Unified Memory for CUDA Beginners Technical Blog	46	2543	December 1, 2023
Unified Memory in CUDA 6 Technical Blog	87	1897	August 16, 2019
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2210	January 18, 2023
Improving GPU Memory Oversubscription Performance Technical Blog	4	841	November 2, 2021
Beyond GPU Memory Limits with Unified Memory on Pascal Technical Blog	15	880	March 11, 2022
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204315	April 13, 2009
Using Shared Memory in CUDA C/C++ Technical Blog	36	1966	October 8, 2020
Why doesn't overlapping data transfers and kernel execution work here? CUDA Programming and Performance	81	212	March 6, 2025
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134565	May 26, 2010
CUDA very slow performance CUDA Programming and Performance	21	16610	March 6, 2020

Maximizing Unified Memory Performance in CUDA

Related topics