With unified memory, is there a way to overwrite data that was last used on host, on the device, without causing page faults?

toresibe · May 5, 2022, 11:06am

I am basically trying to emulate the standard CUDA programming model with explicit memory copies while using unified memory. I do this because I want to run the code on both desktop GPUs with physically separate device and host memory, and on Jetson modules with shared physical memory. Unified memory with prefetching makes this very fast and the code is easy to write for both kinds of hardware.

My processing workflow goes like this:

Device compute and write to array. (kernel1<<<m,n>>>(a);)
Memory copy to host. (cudaMemcpy(b, a, DtoD); cudaMemPrefetchAsync(b, …); kernel2<<<m,n>>>(c);)
Read only processing on CPU.
Repeat.

When I use unified memory I can have step 2 be a DtoD copy, and then overlap a cudaMemPrefetchAsync with kernel execution and hide the memcpy to host time entirely. On the Jetson this Prefetch is simply ignored and the data is already available for the CPU after the DtoD copy. If I used a cudaMemcpyAsync(…, DtoH) at this point I would get the desired result on desktop, but it would be very slow on the Jetson.

My issue is that in the next iteration when I do the DtoD copy to the UM array I get lots of page faults which slows the throughput down to about half the value I would expect for a DtoH copy. This is not surprising, as the array was last used on the host. This can be avoided by another cudaMemPrefetchAsync after the CPU processing is done to get the array back to the device, but that means moving old data that is no longer relevant. With this prefetch in place I get full DtoD throughput on this copy, but at the cost of an extra HtoD memcpy before the next processing iteration can start. I want to simply overwrite this data on the device without having to synchronize with the host until I do so explicitly.

I know this goes against the whole concept of unified memory, but I am using UM anyway for the other benefits in this use case. I have tried various cudaMemAdvise settings, and with that I am able to avoid the page faults, but I still only get DtoH throughput.

Is there a way to somehow advise CUDA to stop handling page faults while I overwrite the data on the device? Or any other way to achieve optimal performance on both architectures with the same code?

Robert_Crovella · May 5, 2022, 2:04pm

prefetch array b to the device while kernel 1 is running

(I know you already said this). There is no way that I know of to invalidate the b array.

system · January 3, 2023, 9:46am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Using CUDA Unified memory on embedded board (psychical unified memory) CUDA Programming and Performance	6	1494	July 14, 2016
Optimising GPU and CPU memory transfer time (CUDA/Hardware)? CUDA Programming and Performance hw , cuda	8	4271	January 7, 2022
CUDA 8: Uniform-memory overlapped host-device copies for Maxwell? GPU-Accelerated Libraries	5	896	September 6, 2016
Performance issues after refactoring CUDA code to avoid managed memory CUDA Programming and Performance jetson	5	77	November 19, 2024
Unified memory oversubscription and page faults CUDA Programming and Performance	7	2824	March 23, 2018
Asynchronous memory transfer on Jetson TX1 Jetson TX1	10	1618	October 18, 2021
Zero-Copy and Managed memory on Jetson Jetson TX1	9	11739	August 20, 2018
cudaStreamAttachMemAsync behavior questions GPU-Accelerated Libraries	0	1675	September 19, 2016
RE: Performance issues after refactoring CUDA code to avoid managed memory Jetson AGX Xavier cuda	4	42	November 25, 2024
Unexpected managed (unified) memory behaviour CUDA Programming and Performance	0	554	May 29, 2019

With unified memory, is there a way to overwrite data that was last used on host, on the device, without causing page faults?

Related topics