I’m looking at the documentation for cudaMemcpy and cudaMemcpyAsync and I can’t figure out whether calling cudaMemcpyAsync allows the CPU to overlap with the memcpy. Is this the case? Is the CPU still involved in the memcpy after it calls cudaMemcpyAsync or does the GPU take over here?
GPU does all the actual copying via DMA–this is why cudaMemcpyAsync must use pinned memory and doesn’t have to block.
Great. Thanks for the help.