I try to combine “Overlap of Data Transfer and Kernel Execution” with “Concurrent Kernel Execution” like the attached figure.
Data transfers from host to device overlap with kernel execution.
Kernel A, B and C are executed concurrently for each data.
But, I don’t know how to program it.
Associated cudaMemcpyAsync() and kernel-call need to be given a same stream when overlapping data transfer with kernel execution.
On the other hand, each kernel-call need to be given different streams to execute them concurrently.
So, Kernel A, B and C need to be given the same stream as cudaMemcpyAsync() of associated data.
But, they might not be executed concurrently if they are given a same stream.