Here is what I want to do:
-
send data to the device
-
compute kernel A
-
compute kernel B
-
get data from the device
I need to do that on successive blocks of data.
Since I have a 1.1 capable device, I want to use streams to overlap communications with GPU computation.
From the examples, I would assume I need to do something like:
for (blocks)
cudaMemcpyAsyncHostToDevice(..., stream[block]);
for (blocks)
kernel_A<<<... stream[block]>>>();
for (blocks)
kernel_B<<<... stream[block]>>>();
for (blocks)
cudaMemcpyAsyncDeviceToHost(..., stream[block]);
Which should overlap computation “b” and communication “b+1”/“b-1”.
Now the problem is that to compute kernel_A and kernel_B for block “b”, I need to make sure that “b-1” has been finished for these 2 kernels.
Since CUDA won’t launch several kernels at the same time on the GPU, I was thinking of something like that:
for (blocks)
cudaMemcpyAsyncHostToDevice(..., stream[block]);
for (blocks) {
kernel_A<<<... stream[block]>>>();
kernel_B<<<... stream[block]>>>();
}
for (blocks)
cudaMemcpyAsyncDeviceToHost(..., stream[block]);
In theory, since kernel_A(b) and kernelB(b) are called before their b+1 counterparts, and since CUDA won’t run 2 kernels concurrently, I assume this should do the trick.
But it does not.
Any ideas on how to process in this case? And most generally with multiple kernels instead of just send/compute 1 kernel/recv ?