I am testing a program that looks like this:
kernel_func<<<1, 32, 0, streams[0]>>>(data,0);
cudaMemcpyAsync(data, host_ptr, cudaMemcpyHostToDevice, streams[2]);
kernel_func<<<1, 32, 0, streams[1]>>> (data,1);
After issuing kernel to streams[0] and before issuing to streams[1], there is a async memcpy, issued to streams[2], which modifies the input data used by kernel_func. If host_ptr is page-locked, the operation will be async and may overlap with the kernel_func on streams[1], so the result of kernel_func may not reflect the data changes from the async memcpy. But what if host_ptr is not page-locked?
I have tried using host_ptr that is pointed to a static host array, and the result shows the kernel_func on streams[1] has seen the data changes. That works well. But when I add another kernel_func issuing before it, like:
kernel_func<<<1, 32, 0, streams[0]>>>(data,0);
kernel_func<<<1, 32, 0, streams[2]>>>(data,0);
cudaMemcpyAsync(data, host_ptr, cudaMemcpyHostToDevice, streams[2]);
kernel_func<<<1, 32, 0, streams[1]>>> (data,1);
The result is the same as when host_ptr is pointed to page-locked memory allocated by cudaHostAlloc! But why? Why calling the kernel function on the same stream will make the memcpy overlapping with the next stream?