Does GK20A support execution of multiple streams concurrently ? Whenever MemcpyAsync is done on different streams (I have created 16 streams), the next MemcpyAsync always start after the end of first MemcpyAsync.
Isn’t it at least compute capability 3? Implying that it should…
You need to pass a stream to memcpyasync to have it not use the default stream
And if you place several memcpyasync back to back, I am not entirely sure whether it would indeed execute concurrently, even if placed in different streams, as it is (global) memory copies, not kernels
Run deviceQuery on that GK20A. I think you’ll see that it reports having 1 copy engine:
This means that all cudaMemcpy*** operations will be serialized.
“The next MemcpyAsync always start after the end of the first MemcpyAsync” is always true for any GPU, when the MemcpyAsync operations are going in the same direction (H2D or D2H) and is always true for any GPU when there is only one copy engine.
Streams only allow overlap of max 2 MemcpyAsync operations when:
- The device has 2 copy engines
- The copy operations are going in opposite directions (one is H2D and the other is D2H)