cudaStreamAttachMemAsync without the cost of prefetching on Tegra?

wsmlby · November 1, 2020, 10:55am

Since CudaAllocate and CudaFree is pretty expensive, I am trying to reuse the memory.

so for 2(as example) round of processing:
instead of
cudaMallocManaged(size, cudaMemAttachHost) => write data from host => sync to Device => use in device => free => cudaMallocManaged(size, cudaMemAttachHost) => write data from host => sync to Device => use in device => free,
I want to:
cudaMallocManaged(size, cudaMemAttachHost) =>write data from host => sync to Device => use in device => reattach to Host => write data from host => sync to Device => use in device => free

However, the (reattach to Host) part, to change one managed memory from cudaMemAttachGlobal to cudaMemAttachHost, I have to call cudaStreamAttachMemAsync then cudaStreamSynchronize, which cause the prefetch the data from Device to CPU cache.

But since I am trying to reuse the memory, that prefetch was not needed(actually the content of that memory is just not needed anymore), so it is a waste of time. Is there anyway we can just attach the memory to host without the cost of prefetching?

Thanks.

njuffa · November 1, 2020, 5:34pm

Questions regarding NVIDIA’s embedded platforms usually receive better/faster answers in the sub-forums dedicated to them:

Topic		Replies	Views
Question about cudaManagedMemory on Jetson AGX Jetson AGX Orin cuda	4	205	November 21, 2024
cudaStreamAttachMemAsync behavior questions GPU-Accelerated Libraries	0	1756	September 19, 2016
cudaStreamAttachMemAsync race condition in TX2 Jetson TX2	34	2128	August 22, 2019
another issue of cudaStreamAttachMemAsync on TX2 Jetson TX2	3	577	August 7, 2019
cudaMemPrefetchAsync why is it Device to Host? Profiling Linux Targets cuda	1	968	May 1, 2023
Questions about efficient memory management for TensorRT on TX2 CUDA Programming and Performance	7	2179	April 21, 2020
What exactly does the managed memory flag do and what changes? CUDA Programming and Performance	5	1408	January 12, 2022
CUDA 8: Uniform-memory overlapped host-device copies for Maxwell? GPU-Accelerated Libraries	5	1006	September 6, 2016
uncached memory created by cudaHostAlloc and cudaMemcpyAsync issues on TX1 Jetson TX1	3	1848	July 15, 2016
Managed memory slow to copy back to host CUDA Programming and Performance cuda	2	609	January 11, 2021

cudaStreamAttachMemAsync without the cost of prefetching on Tegra?

Related topics