How does the cudaMemcpyAsync work with not page-locked memory?

I am testing a program that looks like this:

kernel_func<<<1, 32, 0, streams[0]>>>(data,0);
cudaMemcpyAsync(data, host_ptr, cudaMemcpyHostToDevice, streams[2]);
kernel_func<<<1, 32, 0, streams[1]>>> (data,1);

After issuing kernel to streams[0] and before issuing to streams[1], there is a async memcpy, issued to streams[2], which modifies the input data used by kernel_func. If host_ptr is page-locked, the operation will be async and may overlap with the kernel_func on streams[1], so the result of kernel_func may not reflect the data changes from the async memcpy. But what if host_ptr is not page-locked?

I have tried using host_ptr that is pointed to a static host array, and the result shows the kernel_func on streams[1] has seen the data changes. That works well. But when I add another kernel_func issuing before it, like:

kernel_func<<<1, 32, 0, streams[0]>>>(data,0);
kernel_func<<<1, 32, 0, streams[2]>>>(data,0);
cudaMemcpyAsync(data, host_ptr, cudaMemcpyHostToDevice, streams[2]);
kernel_func<<<1, 32, 0, streams[1]>>> (data,1);

The result is the same as when host_ptr is pointed to page-locked memory allocated by cudaHostAlloc! But why? Why calling the kernel function on the same stream will make the memcpy overlapping with the next stream?

What you posted is a code snippet. Posting a minimal but complete reproducer code will significantly increase the likelihood of getting a good answer. In other words, a small program that others can cut & paste, compile, run, profile. This then hopefully reproduces the behavior you are reporting so a meaningful discussion can ensue.

It may be because I am very tired right now, but I read the OP twice and I am not sure what it is inquiring about.

As explained in the link above, the result of performing async memcpy on non-pinned memory, is not determined itself. And my previous observation of the result is highly dependent on the specific code and data used in my application. In conclusion, async memcpy should always be performed given the memory is already pinned, and my previous results are actually coincidentally produced and cannot be relied on.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.