Parallel execution of GPU and CPU functions using streams

shumi · January 19, 2011, 2:11pm

Hi,

I’m trying to use all computing power of GTX 460 graphic card and all 8 cores of Intel i920 processor.

All cuda stream tutorials cover situations where two or more parallel actions (kernels and memory copy) are executed simultaneously.

What I need is a simple example of using cuda streams to parallel execute GPU kernel and a CPU function using streams.

Let’s assume I have GPU kernels gpu1(), gpu2() and gpu3(); and a CPU function cpu1().

first function gpu1() must be executed.
after gpu1() is complete, gpu2() and cpu1() can run in parrallel
when both gpu2() and cpu1() are done, gpu3() can be executed

gpu1()

       |  |

  -----   ------

  |            |

  Ë‡            Ë‡

 gpu2()      cpu1()

    |        |

    ----  ----

       |  |

       Ë‡  Ë‡

      gpu3()

Can plz someone write an example cuda code using streams…

Ty

avidday · January 19, 2011, 2:25pm

You don’t need streams for that. The CUDA runtime API is naturally asynchronous. When you launch a kernel, control immediately returns to the host thread which executed the launch, and the host thread is free to do whatever it wants while the kernel runs.

shumi · January 19, 2011, 2:41pm

Hmmm …

But if I have a code like this:

gpu_k<<<dimGrid, dimBlock>>>(a, b);

cpu_f(a);

cpu_f will run after gpu_k is done.

How to make both gpu_k and cpu_k run in parallel?

avidday · January 19, 2011, 2:47pm

That happens automatically. You don’t need to do anything.

shumi · January 19, 2011, 2:54pm

So … there will be a barrier only if I call some function like cudaMemcpy?

avidday · January 19, 2011, 3:06pm

The barrier function in the runtime API is cudaThreadSynchronize(), but you are correct that the standard memcpy() functions are also blocking. If you need non-blocking memcpy, then you will need to use the async versions of the calls, and that requires streams. But for overlapping host and device execution, nothing is needed.

shumi · January 19, 2011, 3:12pm

Ty alot avidday :)

Gregory_Diamos · January 20, 2011, 9:41am

There is one caveat here that should be fairly obvious, but if you try this:

cpu_f(a);

gpu_k<<<dimGrid, dimBlock>>>(a, b);

The CPU code will run before the second GPU kernel is started.

Topic		Replies	Views
Asynchronous HtoD memtransfer need to have it asynchronous for cpu, but synchronous for the GPU CUDA Programming and Performance	6	1012	September 9, 2010
My streams are not running concurrently CUDA Programming and Performance	7	1740	March 6, 2018
How does cudaMemcpyPeer(Async) work with streams? CUDA Programming and Performance	1	403	September 25, 2023
Parallel execution of multiple kernels possible? CUDA Programming and Performance	1	1633	June 4, 2008
Cannot force kernels to concurrent execution CUDA Programming and Performance	8	5541	April 28, 2012
Can Multiple kernals run at the same time? CUDA Programming and Performance	3	2785	July 30, 2009
simultaneous run of CPU and CPU CUDA Programming and Performance	2	1242	March 4, 2011
Is it possible to run a cuda kernel on several cpu threads? and How it works? CUDA Programming and Performance	2	1691	December 8, 2014
Using streams... Howto? CUDA Programming and Performance	0	1112	July 25, 2008
Multiple thread/process access to single GPU CUDA Programming and Performance	5	5957	May 13, 2008

Parallel execution of GPU and CPU functions using streams

Related topics