I understand cudaMemcpyAsync can be used to launch memcpy on specific cuda streams, and that different cuda streams can be created with different priorities.
I also understand that cudaMemcpyAsyncs launched on a specific cuda stream execute in FIFO order.
My question is, for a one way memcpy in a single direction (i.e. htod or dtoh), how are the memcpys scheduled across different cuda streams?
Is the prioritization specification only for the kernel execution, or is it also for the DMA engine?
As an illustrative microbenchmark that does memcpy on two different streams launched from a single host thread, I noticed an interesting behavior on nsight.
Stream 16 is assigned a higher priority than stream 15 (using the cudaStreamCreateWithPriority API). The first request on stream 16 goes through, after which it falls back to stream 15.
- Is this expected behavior?
- Is there a way to control memcpy prioritization across cuda streams?