Unified Memory for CUDA Beginners

anon77772973 · August 7, 2018, 5:09pm

Thanks a lot of taking time for this :)
Just to be clear, I have one more question. So when a memory pointer is allocated using cudaMallocManaged(), and I want to initialize its value on the CPU using some function:
--> Will I still be needing to do the cudaMemcpy() and then before the kernel launch I would have to use the cudaMemPrefetchAsync() to make use of the unified memory ?
--> Or I do not need to use the cudaMemcpy() and just use prefecth before the kernel launch? If this is the case, how do I initialize the values of the pointer?

anon95180265 · August 7, 2018, 8:59pm

No, you don't need to cudaMemcpy, and the cudaMemPrefetchAsync is strictly optional. Even if you don't call it, page faults will cause it to migrate to the accessing processor/device automatically on demand. Prefetching is an optimization you can choose to make.

anon77772973 · August 13, 2018, 8:57pm

Hello Mark

I am willing to make the use of cudaMemPrefetchAsync()

Can you describe in brief the flow of the program in case I am going to use cudaMemPrefetchAsync() ?

I am trying to work with a float vector, for example. This vector is going to be used by a cuDNN function and I do not want the overhead added by cudaMemcpy() because this operation is very often in my program.

If I use cudaMallocManaged() for a vector and then use cudaMemPrefetchAsync(), where is the data being initialized (Note that I do not want to initialize the data to constants) ?

If you can point out the steps in brief, it would be really helpful.

Thanks in advance :)

Lakshay

anon50878943 · August 22, 2018, 1:50am

Hello Mark Harris,

I am trying to run these examples on my surface book 2, and I have some issue : none of the 3 methods are improving my results.
When I run nvprof I have the following warning :
"Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: http://docs.nvidia.com/cuda..."

My laptop has two gpu indeed (Intel UHD Graphics 620, and GeForce GTX 1050) but only one is CUDA-compatible. I tried using CUDA_VISIBLE_DEVICES as I read on some forums, but since CUDA only detects one GPU it does not produce any result.
You mentioned on a previous reply that this error might be due to a bug in CUDA 8, however, I am using CUDA v9.2.

Is the bug still present, or should I keep digging to solve this problem ?
Best

anon88276607 · September 18, 2018, 2:29pm

I have the same problem.
I have windows 10
GTX 1060.

anon95180265 · September 21, 2018, 1:30am

Can you try the new driver (410.xx) that was just released? This may help.

anon89175139 · September 29, 2018, 10:05am

Hello,
I just tested the Prefetching method and i got an avg time of 27.682 us on a RTX 2080. I didn't expect such a jump from Pascal to Turing.
Thank you for the tutorial.

anon24719381 · November 10, 2018, 2:42pm

Hi Mark,

I tried to test this code with double precision. it worked with the original code as well as the prefetching code. However, the GPU init function does not work with double precision. Even if I simply put the __global__ void init(int n, double *x, double *y) definition in the code without even calling it in main(), the code would have Segmentation Fault when running. if I change it back to __global__ void init(int n, float *x, float *y) (again without calling it in main()), but keep all other parts of the code with double precision, it will run without a problem. Any idea what was the problem? Thanks!

anon95180265 · November 11, 2018, 10:05pm

Can you provide a link to your modified code so I can take a quick look?

robert.lugg · December 17, 2020, 10:24am

The udacity link to CS344 is broken. Is the class still available? Thanks for the many nice write-ups.

jwitsoe · December 17, 2020, 6:57pm

Robert, unfortunately that course is no longer offered from Udacity. However, you might be interested in the self-paced Fundamentals of Accelerated Computing with CUDAC/C++ course, which covers Unified Memory. Hope that helps!

robert.lugg · December 18, 2020, 11:29pm

For future readers, the videos are still on Youtube (Not sure how much they reflect the actual course) at Using NVVP Part1 - Intro to Parallel Programming - YouTube

user34605 · November 9, 2021, 5:57am

Hi Mark,
It was a wonderful article about unified memory.
I would just like to have one clarification. Let’s say a kernel running on the GPU accesses a page, which is not resident in the GPU memory and so it faults. Now it tries to get the corresponding page from CPU. Lets assume that the corresponding CPU page is not available in CPU RAM ,i.e., it was swapped out to disk previously on CPU. Now, there are following two possibilities in this scenario.

1. CPU swaps in the corresponding CPU page, which will be then migrated to GPU.
2. Since, the corresponding CPU page was not present in CPU RAM, the GPU request will not be satisfied.

My doubt is which of these 2 possibilities occur in the above described scenario?

user34605 · November 9, 2021, 5:59am

Continuing with the previous doubt, I would also like to ask whether the CPU memory part in the unified memory will be page-locked (or pinned) or it will not be pinned memory.

user34605 · November 9, 2021, 9:53am

Pascal and post-Pascal GPUs have the capability for hardware page-faulting. I would like to ask here whether size of a page of GPU is same as the size of a page of CPU.

nsakharnykh · November 11, 2021, 7:35pm

Hello,

cudaMallocManaged allocations are never swapped out to disk. They are “pinned” by the GPU driver, so OS cannot swap that memory to disk.

Nikolay.

nsakharnykh · November 11, 2021, 7:41pm

Yes, Pascal+ GPUs support HW page faulting and on-demand migration. GPU supports multiple different page sizes, but we don’t document publicly what are these sizes.

user34605 · November 27, 2021, 8:13am

Hello,
My system has one CPU and one V100-32 GPU. My code uses two unified memory arrays, X and Y, which are allocated using cudaMallocManaged(). Elements of both arrays, X and Y, are only accessed in GPU kernel and not in the CPU host side. I have provided the screenshot of my code for your reference herewith.

The total execution time of the above program using unified memory is 54 times faster than the time required to just allocate arrays X and Y as pinned memory using cudaMallocHost() as follows:

cudaMallocHost(&x, N*sizeof(float));

cudaMallocHost(&y, N*sizeof(float));

My doubt is that, if cudaMallocHost() as well as cudaMallocManaged(), both allocate pinned memory on the CPU host side, then why just pinned memory allocation for arrays, X and Y, using cudaMallocHost() is 54 times slower than the whole program execution using unified memory for the same arrays X and Y.

SIDE NOTE- I have used clock() function to measure the execution times. Also cudaDeviceSynchronize() was used after the kernel to get correct time measurement.

nsakharnykh · November 30, 2021, 12:38am

cudaMallocManaged call itself does not allocate any memory. It only reserves the VA, and physical backing will be allocated on first touch, depending on the accessing processor. In your example above, memory will be allocated during add kernel execution: SMs will try to access the VA, trigger faults, Unified Memory driver will process the faults, and allocate GPU memory and will use larger GPU page sizes. It you were to touch the memory first on the host, the physical backing will be allocated there and may use different page sizes.

cudaMallocHost actually allocates physical memory on the CPU, so it’s more expensive than cudaMallocManaged call alone. Also, depending on your system, CPU may use smaller page sizes, so allocating and pinning memory on the CPU may take longer than doing the same on the GPU.

@user34605 hope that clears things up!

user34605 · December 1, 2021, 7:12pm

Hello,
Thanks for the reply. That was quite helpful. I just had one more small doubt regarding unified memory.

I completely understand the point that cudaMallocManaged() only reserves the VA, and physical backing will be allocated on first touch, depending on the accessing processor.

Now, lets say that I allocate one array X, whose size is equal to 10 CPU pages, using cudaMallocManaged(). And in my whole program, I access only first element of X ,i.e., X[0]. As per my understanding, CPU memory for whole array X will be allocated, when I access X[0], and not when I call cudaMallocManaged(X) (More specifically, on the first touch by CPU).

Now, which of the following two conditions will hold, when I first access X[0] on CPU?

Only first CPU page for array X will be page-locked on accessing X[0] for the first time, as I am only accessing X[0].
All 10 CPU pages for array X will be page-locked on accessing X[0] for the first time.

Topic		Replies	Views
Maximizing Unified Memory Performance in CUDA Technical Blog	18	1509	May 14, 2019
Unified Memory in CUDA 6 Technical Blog	87	2588	August 16, 2019
Kernel lunch overhead increases significantly (10x) when using unified memory on TK1 and TX1 Jetson TK1	5	3477	August 31, 2018
Beyond GPU Memory Limits with Unified Memory on Pascal Technical Blog	15	1081	March 11, 2022
Improving GPU Memory Oversubscription Performance Technical Blog	5	961	July 16, 2025
Unified memory oversubscription and page faults CUDA Programming and Performance	7	2964	March 23, 2018
cudaMemPrefetchAsync returns cudaErrorInvalidDevice CUDA Programming and Performance	21	4833	November 15, 2021
Abysmal performance with Unified Memory and CUBLAS CUDA Programming and Performance	15	4629	November 29, 2014
Zero Copy Memory vs Unified memory CUDA processing Jetson TX1	28	21122	October 18, 2021
Question on working of CUDA Unified Memory CUDA Programming and Performance cuda	1	621	December 6, 2021

Unified Memory for CUDA Beginners

Related topics