I just did a small benchmark to check the time to transfer data between the host device.
In particular, I tested cudaMemcpy(), cudaMemcpyAsync() with the default stream, and with a stream other than the default one, for transferring memory chunk sizes ranging from 8 bytes to 1024 bytes.
Here are some facts,
- Using cudaMemcpyAsync() with the default stream is the fastest, then cudaMemcpy(), then cudaMemcpyAsync() with another stream.
- Using cudaMemcpyAsync() with a stream other than the default one occurs 10 times overhead than using the default stream.
I am very curious about the fact that using cudaMemcpyAsync() with another stream occurs such a large overhead. Could anyone have any clue why?
Thanks,
B