Cuda stream question regarding transfer, kernel execution concurrency

I used two stream example to do some observation. Based on that it appears following:

  • two streams can execute concurrently without disregard to each other with its own even handlers.
  • 1 stream can execute kernel while other can execute transfer. Again this seems already provisioned and core idea of having streams according to book definition (although book’s cuda version is very old and I am not sure what enhancements are made since)
  • Also by analyzing the traces from NVVP (Nvidia visual profiler), it seems two transfer (either device to host or vice versa) can not happen at the same time even though these two transfers are occurring separately on its own thread. I see from trace if two transfers are initiated same time, one would wait for the other to complete. Is it supposed to be so and if so why?

For two transfers that are in the same direction, see here.