How intelligent is the Unified Memory runtime

Tiomat · May 6, 2014, 10:23am

Currently the main program I work with requires datasets that are potentially larger than GPU device memory so I have written a small paging system to ensure data required is available on the device. With the new Unified Memory model brought in with Cuda 6.0, is this currently handled or is it on the roadmap to be able to handle this?

For example if we had a device with 1GB device memory, and a dataset consisting of 1500 objects each 1MB each. Currently my method would be to allocate as much on the device as possible, allocate the remainder on the host and when a host object is required to copy it onto the device.

My question is how does it handle it?
Scenario 1) MallocManaged returns a failure when attempting to allocate block 1001 (when device memory is full)

Scenario 2) MallocManaged works as expected (1-1000 on device, 1001-1500 on host) however when attempting to use object 1001 it errors.

Scenario 3) MallocManaged works as per 2, when accessing 1001 it works however performance is poor due to repeated host copies whenever data is accessed.

Scenario 4) Malloc works, and access is good as the runtime performs a copy of the whole memory block onto some spare space on the device. This may be out of reach of what the runtime is capable as it may require knowledge of how the data is going to be used.

It sounds like it would be great for tidying code up, but my current gut feeling is that performance may suffer because of the inability for the runtime to magically know how things are going to be used in future.

seibert · May 6, 2014, 11:23am

When the CUDA 6 release candidate was posted, I did some tests of the current behavior of Unified Memory here:

I think they ultimately want unified memory to move 4096 byte pages between memory spaces at the point of usage. The only problem right now seems to be the lack of sufficient hardware support on the Kepler and first generation Maxwell devices for page faults on the device to be handled individually.

Tiomat · May 6, 2014, 1:07pm

Thanks for the response. That was a nice and detailed post, which makes it look (unfortunately) like it is scenario 1. I am not surprised that it could not magically determine the perfect situation, but I was somewhat hoping for scenario 3 as an intermediary. I guess now is not the time for me to migrate over to Unified Memory for my existing codebase.

seibert · May 6, 2014, 1:55pm

I think once CUDA devices have more complete virtual memory capabilities, your scenario 4 is completely doable. The current rules around access of managed memory allocations make it safe to migrate memory pages to the device as they are needed, which would permit allocations to be larger than the GPU’s physical memory.

Topic		Replies	Views
Does CUDA unified memory solve data movement issues on newer GPUs? CUDA Programming and Performance cuda	2	191	April 26, 2024
Managed memory vs cudaHostAlloc - TK1 Jetson TK1	6	2017	February 15, 2016
First impressions of CUDA 6 managed memory CUDA Programming and Performance	1	2487	February 25, 2014
Unified Memory Allocation Alignment on Windows CUDA Programming and Performance	0	555	August 23, 2020
CUDA 6.5 Unified Memory (cudamallocmanaged) CUDA Programming and Performance	1	2171	February 18, 2015
Problems with Unified Memory Under Pascal CUDA Programming and Performance	2	1083	January 24, 2017
Unified memory in Pascal architecture. CUDA Programming and Performance	1	721	August 4, 2017
Unified memory for CC 6.1 CUDA Programming and Performance	4	1399	December 6, 2016
sth wierd about managed memory and free GPU memeory size CUDA Programming and Performance	2	650	November 25, 2019
Using cudaHostAlloc CUDA Programming and Performance	0	6492	May 9, 2011

How intelligent is the Unified Memory runtime

Related topics