Overhead of using non-default stream with cudaMemcpyAsync() too high?

I just did a small benchmark to check the time to transfer data between the host device.

In particular, I tested cudaMemcpy(), cudaMemcpyAsync() with the default stream, and with a stream other than the default one, for transferring memory chunk sizes ranging from 8 bytes to 1024 bytes.

Here are some facts,

  1. Using cudaMemcpyAsync() with the default stream is the fastest, then cudaMemcpy(), then cudaMemcpyAsync() with another stream.
  2. Using cudaMemcpyAsync() with a stream other than the default one occurs 10 times overhead than using the default stream.

I am very curious about the fact that using cudaMemcpyAsync() with another stream occurs such a large overhead. Could anyone have any clue why?

Thanks,
B

I would consider using larger data sizes than 1024 bytes, on my machine the bandwidth hits 90% of max around 64k bytes and max at about 1M bytes.

Yeah, for larger sizes, using an non-default stream doesn’t make much difference, compared to using the default one, since the actual copy will dominate the overall time. What I want to know is why using non-default streams introduces such a (fixed?) high overhead.