Using streams... Howto?

Geka · July 25, 2008, 4:45pm

Here is what I want to do:

send data to the device
compute kernel A
compute kernel B
get data from the device

I need to do that on successive blocks of data.

Since I have a 1.1 capable device, I want to use streams to overlap communications with GPU computation.

From the examples, I would assume I need to do something like:

for (blocks)

   cudaMemcpyAsyncHostToDevice(..., stream[block]);

for (blocks)

    kernel_A<<<... stream[block]>>>();

for (blocks)

   kernel_B<<<... stream[block]>>>();

for (blocks)

    cudaMemcpyAsyncDeviceToHost(..., stream[block]);

Which should overlap computation “b” and communication “b+1”/“b-1”.

Now the problem is that to compute kernel_A and kernel_B for block “b”, I need to make sure that “b-1” has been finished for these 2 kernels.

Since CUDA won’t launch several kernels at the same time on the GPU, I was thinking of something like that:

for (blocks)

   cudaMemcpyAsyncHostToDevice(..., stream[block]);

for (blocks) {

    kernel_A<<<... stream[block]>>>();

    kernel_B<<<... stream[block]>>>();

}

for (blocks)

    cudaMemcpyAsyncDeviceToHost(..., stream[block]);

In theory, since kernel_A(b) and kernelB(b) are called before their b+1 counterparts, and since CUDA won’t run 2 kernels concurrently, I assume this should do the trick.

But it does not.

Any ideas on how to process in this case? And most generally with multiple kernels instead of just send/compute 1 kernel/recv ?

Topic		Replies	Views
Any method for time overlap? CUDA Programming and Performance	2	4548	April 13, 2009
Combination of "Overlap of Data Transfer" and "Concurrent Kernel Execution" CUDA Programming and Performance	1	1337	September 14, 2011
streams not overlapping CUDA Programming and Performance	1	1584	May 23, 2011
Strange behavior with overlap of transfer and compute CUDA Programming and Performance	4	3988	October 19, 2011
An idea about streams CPU-side streams! CUDA Programming and Performance	1	1294	July 25, 2008
About Stream control CUDA Programming and Performance	1	971	March 26, 2009
Parallel execution of GPU and CPU functions using streams CUDA Programming and Performance	7	49484	January 20, 2011
Weird behaviour of CUDA streams CUDA Programming and Performance	0	1910	June 17, 2010
Overlapping memcpyasync and kernel execution CUDA Programming and Performance	0	1105	July 28, 2008
STREAMS CUDA Programming and Performance	0	750	November 8, 2009

Using streams... Howto?

Related topics