Unified Memory for CUDA Beginners

Thanks a lot of taking time for this :)
Just to be clear, I have one more question. So when a memory pointer is allocated using cudaMallocManaged(), and I want to initialize its value on the CPU using some function:
--> Will I still be needing to do the cudaMemcpy() and then before the kernel launch I would have to use the cudaMemPrefetchAsync() to make use of the unified memory ?
--> Or I do not need to use the cudaMemcpy() and just use prefecth before the kernel launch? If this is the case, how do I initialize the values of the pointer?

No, you don't need to cudaMemcpy, and the cudaMemPrefetchAsync is strictly optional. Even if you don't call it, page faults will cause it to migrate to the accessing processor/device automatically on demand. Prefetching is an optimization you can choose to make.

Hello Mark

I am willing to make the use of cudaMemPrefetchAsync()

Can you describe in brief the flow of the program in case I am going to use cudaMemPrefetchAsync() ?

I am trying to work with a float vector, for example. This vector is going to be used by a cuDNN function and I do not want the overhead added by cudaMemcpy() because this operation is very often in my program.

If I use cudaMallocManaged() for a vector and then use cudaMemPrefetchAsync(), where is the data being initialized (Note that I do not want to initialize the data to constants) ?

If you can point out the steps in brief, it would be really helpful.

Thanks in advance :)

Lakshay

Hello Mark Harris,

I am trying to run these examples on my surface book 2, and I have some issue : none of the 3 methods are improving my results.
When I run nvprof I have the following warning :
"Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: http://docs.nvidia.com/cuda..."

My laptop has two gpu indeed (Intel UHD Graphics 620, and GeForce GTX 1050) but only one is CUDA-compatible. I tried using CUDA_VISIBLE_DEVICES as I read on some forums, but since CUDA only detects one GPU it does not produce any result.
You mentioned on a previous reply that this error might be due to a bug in CUDA 8, however, I am using CUDA v9.2.

Is the bug still present, or should I keep digging to solve this problem ?
Best

I have the same problem.
I have windows 10
GTX 1060.

Can you try the new driver (410.xx) that was just released? This may help.

Hello,
I just tested the Prefetching method and i got an avg time of 27.682 us on a RTX 2080. I didn't expect such a jump from Pascal to Turing.
Thank you for the tutorial.

Hi Mark,

I tried to test this code with double precision. it worked with the original code as well as the prefetching code. However, the GPU init function does not work with double precision. Even if I simply put the __global__ void init(int n, double *x, double *y) definition in the code without even calling it in main(), the code would have Segmentation Fault when running. if I change it back to __global__ void init(int n, float *x, float *y) (again without calling it in main()), but keep all other parts of the code with double precision, it will run without a problem. Any idea what was the problem? Thanks!

Can you provide a link to your modified code so I can take a quick look?

The udacity link to CS344 is broken. Is the class still available? Thanks for the many nice write-ups.

Robert, unfortunately that course is no longer offered from Udacity. However, you might be interested in the self-paced Fundamentals of Accelerated Computing with CUDAC/C++ course, which covers Unified Memory. Hope that helps!

1 Like

For future readers, the videos are still on Youtube (Not sure how much they reflect the actual course) at https://www.youtube.com/watch?v=hyKA5fb5ZJI

Hi Mark,
It was a wonderful article about unified memory.
I would just like to have one clarification. Let’s say a kernel running on the GPU accesses a page, which is not resident in the GPU memory and so it faults. Now it tries to get the corresponding page from CPU. Lets assume that the corresponding CPU page is not available in CPU RAM ,i.e., it was swapped out to disk previously on CPU. Now, there are following two possibilities in this scenario.

1. CPU swaps in the corresponding CPU page, which will be then migrated to GPU.
2. Since, the corresponding CPU page was not present in CPU RAM, the GPU request will not be satisfied.

My doubt is which of these 2 possibilities occur in the above described scenario?

Continuing with the previous doubt, I would also like to ask whether the CPU memory part in the unified memory will be page-locked (or pinned) or it will not be pinned memory.

Pascal and post-Pascal GPUs have the capability for hardware page-faulting. I would like to ask here whether size of a page of GPU is same as the size of a page of CPU.

Hello,

cudaMallocManaged allocations are never swapped out to disk. They are “pinned” by the GPU driver, so OS cannot swap that memory to disk.

Nikolay.

1 Like

Yes, Pascal+ GPUs support HW page faulting and on-demand migration. GPU supports multiple different page sizes, but we don’t document publicly what are these sizes.

1 Like

Hello,
My system has one CPU and one V100-32 GPU. My code uses two unified memory arrays, X and Y, which are allocated using cudaMallocManaged(). Elements of both arrays, X and Y, are only accessed in GPU kernel and not in the CPU host side. I have provided the screenshot of my code for your reference herewith.

The total execution time of the above program using unified memory is 54 times faster than the time required to just allocate arrays X and Y as pinned memory using cudaMallocHost() as follows:

cudaMallocHost(&x, N*sizeof(float));

cudaMallocHost(&y, N*sizeof(float));

My doubt is that, if cudaMallocHost() as well as cudaMallocManaged(), both allocate pinned memory on the CPU host side, then why just pinned memory allocation for arrays, X and Y, using cudaMallocHost() is 54 times slower than the whole program execution using unified memory for the same arrays X and Y.

SIDE NOTE- I have used clock() function to measure the execution times. Also cudaDeviceSynchronize() was used after the kernel to get correct time measurement.

cudaMallocManaged call itself does not allocate any memory. It only reserves the VA, and physical backing will be allocated on first touch, depending on the accessing processor. In your example above, memory will be allocated during add kernel execution: SMs will try to access the VA, trigger faults, Unified Memory driver will process the faults, and allocate GPU memory and will use larger GPU page sizes. It you were to touch the memory first on the host, the physical backing will be allocated there and may use different page sizes.

cudaMallocHost actually allocates physical memory on the CPU, so it’s more expensive than cudaMallocManaged call alone. Also, depending on your system, CPU may use smaller page sizes, so allocating and pinning memory on the CPU may take longer than doing the same on the GPU.

@user34605 hope that clears things up!

Hello,
Thanks for the reply. That was quite helpful. I just had one more small doubt regarding unified memory.

I completely understand the point that cudaMallocManaged() only reserves the VA, and physical backing will be allocated on first touch, depending on the accessing processor.

Now, lets say that I allocate one array X, whose size is equal to 10 CPU pages, using cudaMallocManaged(). And in my whole program, I access only first element of X ,i.e., X[0]. As per my understanding, CPU memory for whole array X will be allocated, when I access X[0], and not when I call cudaMallocManaged(X) (More specifically, on the first touch by CPU).

Now, which of the following two conditions will hold, when I first access X[0] on CPU?

  1. Only first CPU page for array X will be page-locked on accessing X[0] for the first time, as I am only accessing X[0].
  2. All 10 CPU pages for array X will be page-locked on accessing X[0] for the first time.