Official cuda c programing guide section 9.1.2 says “In contrast with cudaMemcpy(), the asynchronous transfer version requires pinned host memory (see Pinned Memory), and it contains an additional argument, a stream ID”
Official document also gives an concurrent copy and execute example as below:
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, stream1);
kernel<<<grid, block, 0, stream2>>>(otherData_d);
Here in above example, is a_h must be pinned memory by calling cudaHostAlloc or cudaHostRegister? Normal host memory allocated by malloc can get concurrent copy and execute effect? I tried non-pinned memory, it functionally worked but I didn’t know if it could concurrent copy and execute.
Thanks.