Unified Memory for CUDA Beginners

The reason why I have asked this doubt is because of the following observations:

The total execution time for the below program, which uses array X as unified memory and which accesses only 0th index element of X, takes 0.10 seconds.

Screenshot_2021-12-02 File Editor - Expanse Portal - pinned cu

Whereas, allocating array X as pinned memory using cudaMallocHost(), as shown below, takes 11.26 seconds.

cudaMallocHost(&x, N*sizeof(float));

If first touch to X using program statement - X[0] = 3.0;, page-locks (or pins) the whole CPU memory for X, then above referenced unified memory program’s execution time should have been at least 11.26 seconds.

The following is not correct:

As per my understanding, CPU memory for whole array X will be allocated, when I access X[0]

Only the first page will be allocated when you access X[0]. The pages are allocated and migrated “on-demand”, so Unified Memory driver will allocate and page-lock only pages that you actually access.

Therefore, in your example, the cudaMallocManaged + accessing x[0] performs much faster than cudaMallocHost of the whole array.

Consider a hypothetical scenario, where my GPU global memory size is 4 bytes and all the four bytes are usable. We assume that GPU page size is constant and is 1 byte, i.e., we have 4 usable GPU pages. Also, assume that GPU uses LRU(Least recently used) page replacement mechanism.

I allocate two char arrays X and Y as unified memory using cudaMallocManaged(). X and Y are each of size 30 bytes.

cudaMallocManaged((void ** )&X,30)
cudaMallocManaged((void ** )&X,30)

Now 4 elements are initialized in GPU in the following order

X[0], Y[0], X[1], Y[1]

After allocation of 4 GPU pages for these elements, GPU global memory will look like as shown below.
I1

Now, the GPU global memory is fully occupied. If we now try to initialize X[2] in the GPU, the first page which contains X[0] should be replaced (as it was used least recently) and X[2] should occupy the first page. After this operation, GPU global memory should like as shown below
I2

I would just like to confirm whether my understanding of GPU page replacement mechanism for unified memory arrays for above example case is correct or not.

I would also like to ask whether the policy of allocating the free GPU page with the least index is used, when the GPU pages are allocated for unified memory array elements in the GPU.

For example - again consider a scenario where where my GPU global memory size is 4 bytes and all the four bytes are usable. We assume that GPU page size is constant and is 1 byte, i.e., we have 4 usable GPU pages.

I allocate a char array X as unified memory using cudaMallocManaged(). X is of size 30 bytes.

cudaMallocManaged((void ** )&X,30)

Now, I launch a kernel K1, which initializes only X[0]. Since the free GPU page with the least index is the first GPU page, X[0] should occupy the first GPU page. Then my global memory should look like this .
P1

Then, I launch a kernel K2, which initializes only X[4]. Since the free GPU page with the least index now is the second GPU page, X[4] should occupy the second GPU page.Then, my global memory should look like this.
P2

I would just like to confirm whether my understanding of GPU page allocation policy for A UNIFIED MEMORY ARRAY is correct or not through this example.

Hello @user34605,

Your understanding is generally correct for the first case with X and Y arrays and how eviction works. I would clarify though that “GPU global memory” view you’re showing is some abstract view of GPU physical memory. Regarding the second case with only X array, x[4] virtual address can be possibly mapped to any free region of GPU physical memory, i.e. it’s not necessarily will be placed right after x[0], but what you’re showing is a possibility.

If you’re interested to learn more about Unified Memory and have deeper discussion with experts, I would also recommend to attend GTC 2022 virtual connect with experts session we’ll be holding on “CUDA Memory Management” this year in March (similarly to last year’s CUDA Memory Management | NVIDIA On-Demand).

Nikolay.

Hello,
Thanks for the clarification. I would just like to clarify one more thing regarding second case.

As you have said that when we access any virtual address in the GPU which is not currently mapped in the GPU, a new page can be allocated from anywhere in the free GPU memory space for this virtual address. So, is there any rule which is followed for selecting a page from the free GPU memory for the corresponding virtual address or is it selected randomly?

The example program, while useful, does not explain system behavior when there are multiple GPU devices, concurrent accesses and dynamic updates by user regarding parameters such as PreferredLocation and Prefetch.

I would like to understand how actions affect memory locality, mapping and migrations. I tried to go through documentation on CudaMemAdvise and CudaMemPrefetchAsync(), etc. I would like this explained via examples. Is there such a document.

Including below a simple scenario that I would like to understand.

Assume system has 4 GPUs (latest) that support CudaMallocManaged memory. Let’s call these as D1, D2, D3 and D4.

User allocates a buffer and is accessing it on D3.
User sets PreferredLocation as D3
User issues Prefetch to D2
At this point what is the PreferredLocation of buffer D3
I would like to think so
User now runs a kernel on D4. This is a case where
buffer is resident on D2 and is being accessed
by D4.
My reading is buffer will migrate from D2 to D4
Assume kernel on D4 is very long running
Now user launches another kernel on D1. What happens
Will buffer migrate to D1 or remain on D4

I can think of use cases for which I can’t derive system’s behavior is a very clear fashion.

Regards,
Ramesh