How intelligent is the Unified Memory runtime

Currently the main program I work with requires datasets that are potentially larger than GPU device memory so I have written a small paging system to ensure data required is available on the device. With the new Unified Memory model brought in with Cuda 6.0, is this currently handled or is it on the roadmap to be able to handle this?

For example if we had a device with 1GB device memory, and a dataset consisting of 1500 objects each 1MB each. Currently my method would be to allocate as much on the device as possible, allocate the remainder on the host and when a host object is required to copy it onto the device.

My question is how does it handle it?
Scenario 1) MallocManaged returns a failure when attempting to allocate block 1001 (when device memory is full)

Scenario 2) MallocManaged works as expected (1-1000 on device, 1001-1500 on host) however when attempting to use object 1001 it errors.

Scenario 3) MallocManaged works as per 2, when accessing 1001 it works however performance is poor due to repeated host copies whenever data is accessed.

Scenario 4) Malloc works, and access is good as the runtime performs a copy of the whole memory block onto some spare space on the device. This may be out of reach of what the runtime is capable as it may require knowledge of how the data is going to be used.

It sounds like it would be great for tidying code up, but my current gut feeling is that performance may suffer because of the inability for the runtime to magically know how things are going to be used in future.

When the CUDA 6 release candidate was posted, I did some tests of the current behavior of Unified Memory here:

https://devtalk.nvidia.com/default/topic/695408/first-impressions-of-cuda-6-managed-memory/

I think they ultimately want unified memory to move 4096 byte pages between memory spaces at the point of usage. The only problem right now seems to be the lack of sufficient hardware support on the Kepler and first generation Maxwell devices for page faults on the device to be handled individually.

Thanks for the response. That was a nice and detailed post, which makes it look (unfortunately) like it is scenario 1. I am not surprised that it could not magically determine the perfect situation, but I was somewhat hoping for scenario 3 as an intermediary. I guess now is not the time for me to migrate over to Unified Memory for my existing codebase.

I think once CUDA devices have more complete virtual memory capabilities, your scenario 4 is completely doable. The current rules around access of managed memory allocations make it safe to migrate memory pages to the device as they are needed, which would permit allocations to be larger than the GPU’s physical memory.