unable to get the cpu and gpu to run in parallel

Hi all… quite frustrated by this one.

I do something like:

kernel<<<blocks, threads>>>();
useCpu();
cudaThreadSynchronize();

I time each of these three statements.

When useCpu() is empty and does nothing, I get 0.16 ms, 0.00 ms, and 3.81 ms, respectively. Unfortunately, when useCpu() does something useful I get 0.16 ms, 5.37 ms, and 3.81 ms… Since useCpu() takes longer than the kernel execution, I would expect cudaThreadSynchronize() to return immediately, but it’s not.

I’m puzzled that the kernel launch returns before the kernel finishes, as advertised, but execution seems to immediately stall and not continue until cudaThreadSynchronize() is called. What am I missing? Do I need to do something to really enable asynchronous execution?

This is on Win7 using the 3.0 toolkit, if it matters.

Any suggestions would be greatly appreciated!

Tyler

As far as I understand how things work: that there is some sort of a command buffer where all your calls end up. That command buffer gets flushed to the gpu occasionally by the driver and when you call functions like cudaThreadSynchronize(). That means that your call to KernelLaunch doesn’t immediately start executing on the device.

If you have that small amount of work you need to do as fast as possible then you’d better pipeline your code manually and avoid calling cudaThreadSynchronize at all. Also you could play with scheduling type parameter when creating a context to see if it gives you any benefit.

Most people do run very big an heavy kernels and do synchronize with cpu at the end, so that a 4 ms stall is not that important.

Sergey.

As far as I understand how things work: that there is some sort of a command buffer where all your calls end up. That command buffer gets flushed to the gpu occasionally by the driver and when you call functions like cudaThreadSynchronize(). That means that your call to KernelLaunch doesn’t immediately start executing on the device.

If you have that small amount of work you need to do as fast as possible then you’d better pipeline your code manually and avoid calling cudaThreadSynchronize at all. Also you could play with scheduling type parameter when creating a context to see if it gives you any benefit.

Most people do run very big an heavy kernels and do synchronize with cpu at the end, so that a 4 ms stall is not that important.

Sergey.

Hi, thanks for the reply!

It wouldn’t surprise me if this were the case … can anyone confirm or deny? I would have expected the cuda runtime to manage another thread to service asynchronous operations. I only wish I were surprised that the documentation is completely unhelpful.

I’m not sure what you mean by “pipeline my code manually.” The cudaThreadSynchronize() call is to wait for completion of the kernel so the results can be used. As for the scheduling type, it seems to me that it’s only determining how synchronous functions (like cudaThreadSynchronize()) wait, not whether asynchronous functions wait or not.

Oh, don’t be silly! Whether a certain amount of time is important or not depends entirely on the application. CUDA isn’t only for people porting tasks that take hours on a CPU… I like to think that high-speed, real-time processing is important, too.

Hi, thanks for the reply!

It wouldn’t surprise me if this were the case … can anyone confirm or deny? I would have expected the cuda runtime to manage another thread to service asynchronous operations. I only wish I were surprised that the documentation is completely unhelpful.

I’m not sure what you mean by “pipeline my code manually.” The cudaThreadSynchronize() call is to wait for completion of the kernel so the results can be used. As for the scheduling type, it seems to me that it’s only determining how synchronous functions (like cudaThreadSynchronize()) wait, not whether asynchronous functions wait or not.

Oh, don’t be silly! Whether a certain amount of time is important or not depends entirely on the application. CUDA isn’t only for people porting tasks that take hours on a CPU… I like to think that high-speed, real-time processing is important, too.

Well, you are right… Just don’t use cudaThreadSynchronize in this case.

What I mean by manual pipelining is that you can schedule your operations for the next iteration, and only then do wait for the previous iteration to finish using for example cuEventSynchronize. This way you always have gpu working on your next iteration in parallel with whatever your cpu is doing at the moment.

Sergey.

Well, you are right… Just don’t use cudaThreadSynchronize in this case.

What I mean by manual pipelining is that you can schedule your operations for the next iteration, and only then do wait for the previous iteration to finish using for example cuEventSynchronize. This way you always have gpu working on your next iteration in parallel with whatever your cpu is doing at the moment.

Sergey.

I see.

In my case, I need the results of the current iteration before I can schedule the next.

But either way, my problem is that the GPU seems not to progress until I call cudaThreadSynchronize(). I expected that would be the case with eventSync as well, but I guess anything is worth a try.

At this point it seems to me that the only way to get the CPU and GPU going in parallel is to create a thread responsible for launching and managing GPU operations (and of course setting synchronous operations to wait by yielding periodically). That’s not unreasonable, but I don’t see what use the “asynchronous” cuda APIs are if the GPU stalls waiting on the CPU at some unknown point.

I see.

In my case, I need the results of the current iteration before I can schedule the next.

But either way, my problem is that the GPU seems not to progress until I call cudaThreadSynchronize(). I expected that would be the case with eventSync as well, but I guess anything is worth a try.

At this point it seems to me that the only way to get the CPU and GPU going in parallel is to create a thread responsible for launching and managing GPU operations (and of course setting synchronous operations to wait by yielding periodically). That’s not unreasonable, but I don’t see what use the “asynchronous” cuda APIs are if the GPU stalls waiting on the CPU at some unknown point.

(some how managed to reply without putting any text)

(some how managed to reply without putting any text)

I haven’t yet seen a situation where one couldn’t arrange the algorithm in such a way he could submit commands in advance. I also doubt you will gain anything by putting all your api calls to a separate thread (I might be wrong of course).

I haven’t yet seen a situation where one couldn’t arrange the algorithm in such a way he could submit commands in advance. I also doubt you will gain anything by putting all your api calls to a separate thread (I might be wrong of course).

The application is real-time processing of video from a camera. Part of the processing computes white balance information from one frame which is applied to the next. So I get frame n, process it, get WB coefficients, and then at some time in the future I get frame n+1, which uses the WB coefficients. But the WB computations are secondary, of course; once I have processed frame n, there is no more work for the GPU to do until frame n+1 is captured.

The application is real-time processing of video from a camera. Part of the processing computes white balance information from one frame which is applied to the next. So I get frame n, process it, get WB coefficients, and then at some time in the future I get frame n+1, which uses the WB coefficients. But the WB computations are secondary, of course; once I have processed frame n, there is no more work for the GPU to do until frame n+1 is captured.

And which steps you perform on the cpu ?

And which steps you perform on the cpu ?

I think we’re getting off-topic.

You should be able to use both the CPU and the GPU at the same time. I thought that the cuda “asynchronous” calls were the way to do that, but it seems not. My best guess is that cpu code and the code to launch / synchronize with gpu kernels need to be on separate threads.

I think we’re getting off-topic.

You should be able to use both the CPU and the GPU at the same time. I thought that the cuda “asynchronous” calls were the way to do that, but it seems not. My best guess is that cpu code and the code to launch / synchronize with gpu kernels need to be on separate threads.

That is incorrect. It is very possible to overlap CPU and GPU computation within a single thread - all my linear algebra codes do this as a basic design tenet. Cuda has been fully asynchronous since 1.0 was released three years ago. I am not sure what you are doing wrong (and whether this is actually an instrumentation/measurement problem), but be assured that you are doing something wrong.