How to make GPU and CPU work at the same time

Hi everyone,
I am doing some image processing stuff to images captured by webcam. What I want to achieve is

CPU: capture frame t0~t4—>capture frame t5~t9–>display frame t0~t4,capture frame t10~t14–>display frame t5~t9,capture frame t15~t19…

GPU: Idle—> compute frame t0~t4–>compute frame t5~t9 -->compute frame t10~t14…

I am able to do this sequentially,
CPU capture frame t0~t4–> GPU compute frame t0~t4–> CPU display t0~t4–> CPU capture frame t5~t9–> GPU compute frame t5~t9–> CPU display t5~t9–>…

There will be relatively large time difference between t4 and t5, and frame t0~t4 is very identical to each other. I think CPU and GPU should be able to work together technically, but I somehow can’t figure it out. Any tips?

They should. Just launch kernell and use async functions. Also, you need to check OS.

Can you give me a more detailed explanation or an example?

Lets say I have main.cpp and,

main.cpp looks like this,


Cam capture(t(i)~t(i+4));


Cam display(t(i)~t(i+4));


how do you make the loop go on after compute(); is called?

kernell launches should be async


capture next frame


display current frame

kernell and capture would work in parallel.

Thank you for your response.

I have a few kernel calls in my, and they have to be launched in order. Does that matters?

Do you mean the frame capture code also has to be written in a cu file? Because I am using OpenCV functions to capture images from webcam, and it won’t compile on nvcc. If so, how do you solve this problem?


kernell launch function returns control to cpu code just after call, so cpu and gpu work in parallel. But it is somehow OS dependent. What is your OS?

My OS is Windows 7 64-bit. When you say cpu code, do you mean the code in .cpp file or the code in .cu file?

It does not matter. kernell <<<>>> returns just after launch while gpu is still working. However, on win7 gpu calls are batched, so need additional tricks.

What are the tricks?

run a lot dummy kernells or other async calls to ensure that actuall call to gpu had been made.


Is it work in run time API?

cudaStreamQuery(0) :P

“Returns cudaSuccess if all operations in stream have completed, or cudaErrorNotReady if not.”

It is hard to figure out that this function sends the batch to gpu.

Which is intentional. Most of the time, you shouldn’t try to manage batching yourself.

This came up in another thread. If the documentation is going to claim that kernel calls are asynchronous when they aren’t in Windows because of batching, that should be explained and the workaround given.