Parallel execution of GPU and CPU functions using streams


I’m trying to use all computing power of GTX 460 graphic card and all 8 cores of Intel i920 processor.

All cuda stream tutorials cover situations where two or more parallel actions (kernels and memory copy) are executed simultaneously.

What I need is a simple example of using cuda streams to parallel execute GPU kernel and a CPU function using streams.

Let’s assume I have GPU kernels gpu1(), gpu2() and gpu3(); and a CPU function cpu1().

  • first function gpu1() must be executed.

  • after gpu1() is complete, gpu2() and cpu1() can run in parrallel

  • when both gpu2() and cpu1() are done, gpu3() can be executed


       |  |

  -----   ------

  |            |

  ˇ            ˇ

 gpu2()      cpu1()

    |        |

    ----  ----

       |  |

       ˇ  ˇ


Can plz someone write an example cuda code using streams…


You don’t need streams for that. The CUDA runtime API is naturally asynchronous. When you launch a kernel, control immediately returns to the host thread which executed the launch, and the host thread is free to do whatever it wants while the kernel runs.

Hmmm …

But if I have a code like this:

gpu_k<<<dimGrid, dimBlock>>>(a, b);


cpu_f will run after gpu_k is done.

How to make both gpu_k and cpu_k run in parallel?

That happens automatically. You don’t need to do anything.

So … there will be a barrier only if I call some function like cudaMemcpy?

The barrier function in the runtime API is cudaThreadSynchronize(), but you are correct that the standard memcpy() functions are also blocking. If you need non-blocking memcpy, then you will need to use the async versions of the calls, and that requires streams. But for overlapping host and device execution, nothing is needed.

Ty alot avidday :)

There is one caveat here that should be fairly obvious, but if you try this:


gpu_k<<<dimGrid, dimBlock>>>(a, b);

The CPU code will run before the second GPU kernel is started.