Using streams... Howto?

Here is what I want to do:

  • send data to the device

  • compute kernel A

  • compute kernel B

  • get data from the device

I need to do that on successive blocks of data.

Since I have a 1.1 capable device, I want to use streams to overlap communications with GPU computation.

From the examples, I would assume I need to do something like:

for (blocks)

   cudaMemcpyAsyncHostToDevice(..., stream[block]);

for (blocks)

    kernel_A<<<... stream[block]>>>();

for (blocks)

   kernel_B<<<... stream[block]>>>();

for (blocks)

    cudaMemcpyAsyncDeviceToHost(..., stream[block]);

Which should overlap computation “b” and communication “b+1”/“b-1”.

Now the problem is that to compute kernel_A and kernel_B for block “b”, I need to make sure that “b-1” has been finished for these 2 kernels.

Since CUDA won’t launch several kernels at the same time on the GPU, I was thinking of something like that:

for (blocks)

   cudaMemcpyAsyncHostToDevice(..., stream[block]);

for (blocks) {

    kernel_A<<<... stream[block]>>>();

    kernel_B<<<... stream[block]>>>();

}

for (blocks)

    cudaMemcpyAsyncDeviceToHost(..., stream[block]);

In theory, since kernel_A(b) and kernelB(b) are called before their b+1 counterparts, and since CUDA won’t run 2 kernels concurrently, I assume this should do the trick.

But it does not.

Any ideas on how to process in this case? And most generally with multiple kernels instead of just send/compute 1 kernel/recv ?