I am basically trying to emulate the standard CUDA programming model with explicit memory copies while using unified memory. I do this because I want to run the code on both desktop GPUs with physically separate device and host memory, and on Jetson modules with shared physical memory. Unified memory with prefetching makes this very fast and the code is easy to write for both kinds of hardware.
My processing workflow goes like this:
- Device compute and write to array. (kernel1<<<m,n>>>(a);)
- Memory copy to host. (cudaMemcpy(b, a, DtoD); cudaMemPrefetchAsync(b, …); kernel2<<<m,n>>>(c);)
- Read only processing on CPU.
- Repeat.
When I use unified memory I can have step 2 be a DtoD copy, and then overlap a cudaMemPrefetchAsync with kernel execution and hide the memcpy to host time entirely. On the Jetson this Prefetch is simply ignored and the data is already available for the CPU after the DtoD copy. If I used a cudaMemcpyAsync(…, DtoH) at this point I would get the desired result on desktop, but it would be very slow on the Jetson.
My issue is that in the next iteration when I do the DtoD copy to the UM array I get lots of page faults which slows the throughput down to about half the value I would expect for a DtoH copy. This is not surprising, as the array was last used on the host. This can be avoided by another cudaMemPrefetchAsync after the CPU processing is done to get the array back to the device, but that means moving old data that is no longer relevant. With this prefetch in place I get full DtoD throughput on this copy, but at the cost of an extra HtoD memcpy before the next processing iteration can start. I want to simply overwrite this data on the device without having to synchronize with the host until I do so explicitly.
I know this goes against the whole concept of unified memory, but I am using UM anyway for the other benefits in this use case. I have tried various cudaMemAdvise settings, and with that I am able to avoid the page faults, but I still only get DtoH throughput.
Is there a way to somehow advise CUDA to stop handling page faults while I overwrite the data on the device? Or any other way to achieve optimal performance on both architectures with the same code?