Overlapping computation and data transfers must use pinned memory or UVA?

Official cuda c programing guide section 9.1.2 says “In contrast with cudaMemcpy(), the asynchronous transfer version requires pinned host memory (see Pinned Memory), and it contains an additional argument, a stream ID”

Official document also gives an concurrent copy and execute example as below:

cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, stream1);
kernel<<<grid, block, 0, stream2>>>(otherData_d);

Here in above example, is a_h must be pinned memory by calling cudaHostAlloc or cudaHostRegister? Normal host memory allocated by malloc can get concurrent copy and execute effect? I tried non-pinned memory, it functionally worked but I didn’t know if it could concurrent copy and execute.

Thanks.

Yes, to witness overlap/concurrency, you would normally want a_h to be allocated with a pinned allocator. If you don’t use a pinned allocator, the operation won’t fail or return an error, but it generally will not overlap with the subsequent kernel call.

You should be able to confirm behavior with a profiler.