Combination of "Overlap of Data Transfer" and "Concurrent Kernel Execution"

Hi.

I try to combine “Overlap of Data Transfer and Kernel Execution” with “Concurrent Kernel Execution” like the attached figure.

  • Data transfers from host to device overlap with kernel execution.
  • Kernel A, B and C are executed concurrently for each data.

But, I don’t know how to program it.

Associated cudaMemcpyAsync() and kernel-call need to be given a same stream when overlapping data transfer with kernel execution.
On the other hand, each kernel-call need to be given different streams to execute them concurrently.

So, Kernel A, B and C need to be given the same stream as cudaMemcpyAsync() of associated data.
But, they might not be executed concurrently if they are given a same stream.

How do I program what I want ?

Sorry, I forgot to attach the figure.

figure.bmp (581 KB)