unable to get the cpu and gpu to run in parallel

rosetter · September 14, 2010, 9:50am

Hi all… quite frustrated by this one.

I do something like:

kernel<<<blocks, threads>>>();
useCpu();
cudaThreadSynchronize();

I time each of these three statements.

When useCpu() is empty and does nothing, I get 0.16 ms, 0.00 ms, and 3.81 ms, respectively. Unfortunately, when useCpu() does something useful I get 0.16 ms, 5.37 ms, and 3.81 ms… Since useCpu() takes longer than the kernel execution, I would expect cudaThreadSynchronize() to return immediately, but it’s not.

I’m puzzled that the kernel launch returns before the kernel finishes, as advertised, but execution seems to immediately stall and not continue until cudaThreadSynchronize() is called. What am I missing? Do I need to do something to really enable asynchronous execution?

This is on Win7 using the 3.0 toolkit, if it matters.

Any suggestions would be greatly appreciated!

Tyler

sergeyn · September 14, 2010, 10:07am

As far as I understand how things work: that there is some sort of a command buffer where all your calls end up. That command buffer gets flushed to the gpu occasionally by the driver and when you call functions like cudaThreadSynchronize(). That means that your call to KernelLaunch doesn’t immediately start executing on the device.

If you have that small amount of work you need to do as fast as possible then you’d better pipeline your code manually and avoid calling cudaThreadSynchronize at all. Also you could play with scheduling type parameter when creating a context to see if it gives you any benefit.

Most people do run very big an heavy kernels and do synchronize with cpu at the end, so that a 4 ms stall is not that important.

Sergey.

sergeyn · September 14, 2010, 10:07am

As far as I understand how things work: that there is some sort of a command buffer where all your calls end up. That command buffer gets flushed to the gpu occasionally by the driver and when you call functions like cudaThreadSynchronize(). That means that your call to KernelLaunch doesn’t immediately start executing on the device.

If you have that small amount of work you need to do as fast as possible then you’d better pipeline your code manually and avoid calling cudaThreadSynchronize at all. Also you could play with scheduling type parameter when creating a context to see if it gives you any benefit.

Most people do run very big an heavy kernels and do synchronize with cpu at the end, so that a 4 ms stall is not that important.

Sergey.

rosetter · September 16, 2010, 5:04am

Hi, thanks for the reply!

It wouldn’t surprise me if this were the case … can anyone confirm or deny? I would have expected the cuda runtime to manage another thread to service asynchronous operations. I only wish I were surprised that the documentation is completely unhelpful.

I’m not sure what you mean by “pipeline my code manually.” The cudaThreadSynchronize() call is to wait for completion of the kernel so the results can be used. As for the scheduling type, it seems to me that it’s only determining how synchronous functions (like cudaThreadSynchronize()) wait, not whether asynchronous functions wait or not.

Oh, don’t be silly! Whether a certain amount of time is important or not depends entirely on the application. CUDA isn’t only for people porting tasks that take hours on a CPU… I like to think that high-speed, real-time processing is important, too.

rosetter · September 16, 2010, 5:04am

Hi, thanks for the reply!

It wouldn’t surprise me if this were the case … can anyone confirm or deny? I would have expected the cuda runtime to manage another thread to service asynchronous operations. I only wish I were surprised that the documentation is completely unhelpful.

I’m not sure what you mean by “pipeline my code manually.” The cudaThreadSynchronize() call is to wait for completion of the kernel so the results can be used. As for the scheduling type, it seems to me that it’s only determining how synchronous functions (like cudaThreadSynchronize()) wait, not whether asynchronous functions wait or not.

Oh, don’t be silly! Whether a certain amount of time is important or not depends entirely on the application. CUDA isn’t only for people porting tasks that take hours on a CPU… I like to think that high-speed, real-time processing is important, too.

sergeyn · September 16, 2010, 8:25am

Well, you are right… Just don’t use cudaThreadSynchronize in this case.

What I mean by manual pipelining is that you can schedule your operations for the next iteration, and only then do wait for the previous iteration to finish using for example cuEventSynchronize. This way you always have gpu working on your next iteration in parallel with whatever your cpu is doing at the moment.

Sergey.

sergeyn · September 16, 2010, 8:25am

Well, you are right… Just don’t use cudaThreadSynchronize in this case.

What I mean by manual pipelining is that you can schedule your operations for the next iteration, and only then do wait for the previous iteration to finish using for example cuEventSynchronize. This way you always have gpu working on your next iteration in parallel with whatever your cpu is doing at the moment.

Sergey.

rosetter · September 17, 2010, 6:25am

I see.

In my case, I need the results of the current iteration before I can schedule the next.

But either way, my problem is that the GPU seems not to progress until I call cudaThreadSynchronize(). I expected that would be the case with eventSync as well, but I guess anything is worth a try.

At this point it seems to me that the only way to get the CPU and GPU going in parallel is to create a thread responsible for launching and managing GPU operations (and of course setting synchronous operations to wait by yielding periodically). That’s not unreasonable, but I don’t see what use the “asynchronous” cuda APIs are if the GPU stalls waiting on the CPU at some unknown point.

rosetter · September 17, 2010, 6:25am

I see.

In my case, I need the results of the current iteration before I can schedule the next.

But either way, my problem is that the GPU seems not to progress until I call cudaThreadSynchronize(). I expected that would be the case with eventSync as well, but I guess anything is worth a try.

At this point it seems to me that the only way to get the CPU and GPU going in parallel is to create a thread responsible for launching and managing GPU operations (and of course setting synchronous operations to wait by yielding periodically). That’s not unreasonable, but I don’t see what use the “asynchronous” cuda APIs are if the GPU stalls waiting on the CPU at some unknown point.

sergeyn · September 17, 2010, 8:20am

(some how managed to reply without putting any text)

sergeyn · September 17, 2010, 8:20am

(some how managed to reply without putting any text)

sergeyn · September 17, 2010, 8:26am

I haven’t yet seen a situation where one couldn’t arrange the algorithm in such a way he could submit commands in advance. I also doubt you will gain anything by putting all your api calls to a separate thread (I might be wrong of course).

sergeyn · September 17, 2010, 8:26am

I haven’t yet seen a situation where one couldn’t arrange the algorithm in such a way he could submit commands in advance. I also doubt you will gain anything by putting all your api calls to a separate thread (I might be wrong of course).

rosetter · September 18, 2010, 2:11am

The application is real-time processing of video from a camera. Part of the processing computes white balance information from one frame which is applied to the next. So I get frame n, process it, get WB coefficients, and then at some time in the future I get frame n+1, which uses the WB coefficients. But the WB computations are secondary, of course; once I have processed frame n, there is no more work for the GPU to do until frame n+1 is captured.

rosetter · September 18, 2010, 2:11am

The application is real-time processing of video from a camera. Part of the processing computes white balance information from one frame which is applied to the next. So I get frame n, process it, get WB coefficients, and then at some time in the future I get frame n+1, which uses the WB coefficients. But the WB computations are secondary, of course; once I have processed frame n, there is no more work for the GPU to do until frame n+1 is captured.

sergeyn · September 18, 2010, 3:01am

And which steps you perform on the cpu ?

sergeyn · September 18, 2010, 3:01am

And which steps you perform on the cpu ?

rosetter · September 19, 2010, 8:51am

I think we’re getting off-topic.

You should be able to use both the CPU and the GPU at the same time. I thought that the cuda “asynchronous” calls were the way to do that, but it seems not. My best guess is that cpu code and the code to launch / synchronize with gpu kernels need to be on separate threads.

rosetter · September 19, 2010, 8:51am

I think we’re getting off-topic.

You should be able to use both the CPU and the GPU at the same time. I thought that the cuda “asynchronous” calls were the way to do that, but it seems not. My best guess is that cpu code and the code to launch / synchronize with gpu kernels need to be on separate threads.

avidday · September 19, 2010, 9:49am

That is incorrect. It is very possible to overlap CPU and GPU computation within a single thread - all my linear algebra codes do this as a basic design tenet. Cuda has been fully asynchronous since 1.0 was released three years ago. I am not sure what you are doing wrong (and whether this is actually an instrumentation/measurement problem), but be assured that you are doing something wrong.

Topic		Replies	Views
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20314	May 4, 2007
Synchronization methods? CUDA Programming and Performance	11	2218	November 7, 2010
Can kernel function parallel with CPU code? CUDA Programming and Performance	12	7817	December 5, 2008
KERNELS are NOT queing , bug in cuda 2.0 ? cudathreadsynchronize() makes no difference ? CUDA Programming and Performance	12	5404	August 17, 2009
How to detect async event without polling CUDA Programming and Performance	28	5967	August 20, 2010
Async Kernel launch cpu seems not getting control after kernel launch CUDA Programming and Performance	7	3232	December 3, 2008
the same thing, different time consuming asking for help CUDA Programming and Performance	5	6283	May 26, 2009
GPU-CPU & GPU-GPU synchronization query on advanced CUDA features CUDA Programming and Performance	12	17517	June 14, 2008
Program hangs at cudaThreadsynchronize CUDA Programming and Performance	12	9684	April 7, 2011
Code hangs... CUDA Programming and Performance	24	20052	August 18, 2010

unable to get the cpu and gpu to run in parallel

Related topics