How does the cudaMemcpyAsync work with not page-locked memory?

shawda.ssik600015 · August 16, 2023, 9:30am

I am testing a program that looks like this:

kernel_func<<<1, 32, 0, streams[0]>>>(data,0);
cudaMemcpyAsync(data, host_ptr, cudaMemcpyHostToDevice, streams[2]);
kernel_func<<<1, 32, 0, streams[1]>>> (data,1);

After issuing kernel to streams[0] and before issuing to streams[1], there is a async memcpy, issued to streams[2], which modifies the input data used by kernel_func. If host_ptr is page-locked, the operation will be async and may overlap with the kernel_func on streams[1], so the result of kernel_func may not reflect the data changes from the async memcpy. But what if host_ptr is not page-locked?

I have tried using host_ptr that is pointed to a static host array, and the result shows the kernel_func on streams[1] has seen the data changes. That works well. But when I add another kernel_func issuing before it, like:

kernel_func<<<1, 32, 0, streams[0]>>>(data,0);
kernel_func<<<1, 32, 0, streams[2]>>>(data,0);
cudaMemcpyAsync(data, host_ptr, cudaMemcpyHostToDevice, streams[2]);
kernel_func<<<1, 32, 0, streams[1]>>> (data,1);

The result is the same as when host_ptr is pointed to page-locked memory allocated by cudaHostAlloc! But why? Why calling the kernel function on the same stream will make the memcpy overlapping with the next stream?

njuffa · August 16, 2023, 10:54am

What you posted is a code snippet. Posting a minimal but complete reproducer code will significantly increase the likelihood of getting a good answer. In other words, a small program that others can cut & paste, compile, run, profile. This then hopefully reproduces the behavior you are reporting so a meaningful discussion can ensue.

It may be because I am very tired right now, but I read the OP twice and I am not sure what it is inquiring about.

Robert_Crovella · August 16, 2023, 12:42pm

shawda.ssik600015 · August 28, 2023, 5:32am

As explained in the link above, the result of performing async memcpy on non-pinned memory, is not determined itself. And my previous observation of the result is highly dependent on the specific code and data used in my application. In conclusion, async memcpy should always be performed given the memory is already pinned, and my previous results are actually coincidentally produced and cannot be relied on.

system · September 11, 2023, 5:32am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Synchronization of cudaMemcpyAsync for pageable memory CUDA Programming and Performance	2	1850	October 3, 2021
cudaMemcpyAsync waiting for another unrelated cudaMemcpyAsync CUDA Programming and Performance cuda	10	177	December 10, 2024
Problem with asynchronous host to host memcpy CUDA Programming and Performance	1	4750	January 4, 2011
cudaMemcpyAsync clarification required & help needed CUDA Programming and Performance	0	1770	October 17, 2009
cudaMemcpyAsync problem CUDA Programming and Performance	9	3310	May 26, 2020
cudaMemcpyAsync same direction overlap CUDA Programming and Performance	1	359	June 29, 2023
Memory-safety of async memcpy CUDA Programming and Performance	3	277	March 5, 2024
cudaMemcpyAync with pageable memory overlap with kernal CUDA Programming and Performance cuda	3	797	January 23, 2023
Asynchronous Memcpy's not overlapping with asynchronous kernel execution despite using cuda streams? CUDA Programming and Performance cuda	4	1198	October 31, 2022
cudaMemcpyAsync and pinned memory CUDA Programming and Performance	1	1153	August 31, 2021

How does the cudaMemcpyAsync work with not page-locked memory?

Related topics