Currently the main program I work with requires datasets that are potentially larger than GPU device memory so I have written a small paging system to ensure data required is available on the device. With the new Unified Memory model brought in with Cuda 6.0, is this currently handled or is it on the roadmap to be able to handle this?
For example if we had a device with 1GB device memory, and a dataset consisting of 1500 objects each 1MB each. Currently my method would be to allocate as much on the device as possible, allocate the remainder on the host and when a host object is required to copy it onto the device.
My question is how does it handle it?
Scenario 1) MallocManaged returns a failure when attempting to allocate block 1001 (when device memory is full)
Scenario 2) MallocManaged works as expected (1-1000 on device, 1001-1500 on host) however when attempting to use object 1001 it errors.
Scenario 3) MallocManaged works as per 2, when accessing 1001 it works however performance is poor due to repeated host copies whenever data is accessed.
Scenario 4) Malloc works, and access is good as the runtime performs a copy of the whole memory block onto some spare space on the device. This may be out of reach of what the runtime is capable as it may require knowledge of how the data is going to be used.
It sounds like it would be great for tidying code up, but my current gut feeling is that performance may suffer because of the inability for the runtime to magically know how things are going to be used in future.