Do the non-async calls sleep or burn CPU?

gperks · January 7, 2008, 5:19pm

Does cudaMemcpy burn CPU or allow other threads to run while busy? Same if I call a kernal synchronously - the current thread is blocked of course, but is it playing nice and sleeping, or is it in a hard loop burning CPU?

Just wondering as my CPU load seems to go UP when I apply the GPU to an algorithm over just using the CPU…

seb · January 7, 2008, 6:04pm

I’m not sure if the various synchronous memcopys “burn” CPU but I think so.
As for kernels I’m not aware of any way to call a kernel in a synchronous manner. You can call the kernel (which is always asynchronous) and then tell the thread to wait for the kernel with cudaThreadSynchronize(). This will indeed be a busy waiting loop and therefore waste CPU time.

AndreiB · January 7, 2008, 6:25pm

AFAIK cudaMemcpy() (between host and device) is synchronous function, i.e. it doesn’t returns until all data is copied to device.

nwilt · January 7, 2008, 6:26pm

There is some platform-dependent variation here.

In CUDA 1.1, the Windows version yields the thread if the hardware is not yet idle and there are other active threads.

On Linux, CUDA does not yield the thread.

The implicit waits in synchronous memcpy calls are the same as the busy waits exposed by *Synchronize() calls such as cuCtxSynchronize() and cuEventSynchronize().

wumpus · January 8, 2008, 9:01am

Why not? Busy looping is a very bad idea for a library that is meant to take load off the CPU.

DenisR · January 8, 2008, 10:49am

If I understand the linux kernel mailinglist sentiment, yielding is not a very smart API anyway (the behaviour is quite undefined) I forgot however which API was preferred.

wumpus · January 8, 2008, 12:09pm

Blocking the thread on some file descriptor, to wake up when the device is finished, is generally used to wait for hardware events. I think it’s best in this case as well.

gperks · January 8, 2008, 5:46pm

Agreed. The point is to offload work from the CPU, letting the CPU get on with other tasks (for example, generating the next batch of data for the GPU to process).

Absolutely you don’t want the CPU burning valuable Hertz while the GPU does it’s thing.

AndreiB · January 8, 2008, 6:31pm

Launch kernel, generate next batch of data or process results of previous invocation and only then synchronize with GPU.

gperks · January 9, 2008, 11:08pm

I have other threads that can do work. If the thread that is handling the GPU is burning inside cudaThreadSynchronize(), those other threads are not getting to do as much work as they could. Ideally, cudaThreadSynchronize should be sleeping awaiting some system event to wake. This would let other threads get all the CPU.

gperks · January 9, 2008, 11:22pm

GetProcessTimes() (on Windows) returns the CPU usage in user and kernel spaces.

If I run a GPU kernel that does very little , I get a low CPU time.
If I run a GPU kernel that does a lot, I get a high CPU time.

The CPU times should not be affected, should they? GetProcessTimes seems to count the time spent on the GPU. That’s why my original question - it looks like running the GPU also uses CPU time :-(

DenisR · January 10, 2008, 7:15am

if in your code you launch a kernel, followed by a memcopy from device to host, the CPU will spin while waiting for the kernel to end before doing the memcopy.

That is where you see the CPU time coming from.

gperks · January 11, 2008, 3:55pm

Denis, thank you for the reply. That would explain that… so I don’t need to call cudaThreadSynchronize before cudaMemcpy device->host then? The cudaMemcpy already contains the sync call?

seb · January 11, 2008, 4:03pm

Yes, that’s right.

gperks · January 11, 2008, 8:04pm

Any chance that we will see an update where the CPU thread sleeps rather than spinning? Is there anywhere I can usefully submit a change request?

Finally… now the 8800s are available for Mac towers (and the 8600 on the notebooks) any word of CUDA for Mac?

MisterAnderson42 · January 11, 2008, 8:10pm

You can add a sleep yourself. Just use the cudaEventQuery function. Just be aware that it will add latency to the kernel execution, of course.

seb · January 11, 2008, 10:19pm

You mean by regularly polling if the event occured from another thread? Isn’t that really ugly…

MisterAnderson42 · January 11, 2008, 10:41pm

I didn’t say it wasn’t ugly :) Just that it can be done.
I’ve never tried it except as a simple test, as my targeted applications for CUDA are HPC clusters where it doesn’t matter if the CPUs are running full bore all the time since the entire machine is completely dedicated to that one task. The fact that a spin-wait in the CUDA driver gives the absolute minimum latency after a kernel launch is nice.

What makes the EventQuery method even uglier (if I were to use it) is that the queue depth for kernel calls is only 16 deep, after which an implicit threadsynchronize is performed. My application calls 1000’s of kerenls a second and possibly up to more than the queue depth in a single step of the algorithm so I would somehow have to insert many eventquery sleep loops in the middle of the calculation: very ugly.

wumpus · January 12, 2008, 1:15am

Of course you can sleep yourself, but the point is that the API doesn’t do that automatically, and it will waste GPU capacity if you do it yourself and don’t know how long to sleep before the GPU is finished (like when using cudaEventQuery). It should really wake up only when the GPU is finished, without burning CPU time. It might not be a problem in your case, but in a lot of (realistic) cases you want both the CPU and GPU to be maxed out as much as possible.

tachyon_john · January 12, 2008, 3:58am

It’s not necessarily so bad to do polling in many cases. In the case of message passing applications, one often has to kick the underlying message passing system regularly in order to cause asynchronous processing of incoming and outgoing messages to make forward progress. In many cases, the GPU just gets added to such an event loop and is just one more thing to keep track of. Apps that do this sort of thing can implement a dynamic load balancing scheme. By keeping track of timestamps on kernel completions etc one can make a prediction of how long (either wall clock time or cycles through a processing loop) until the GPU needs to be checked up on again. The queue depth limitation may be the biggest problem to overcome for apps that have very fast kernel launch rates. I suspect we’ll see these things evolve and improve as time goes on however.

Cheers,

John

I didn’t say it wasn’t ugly :) Just that it can be done.

I’ve never tried it except as a simple test, as my targeted applications for CUDA are HPC clusters where it doesn’t matter if the CPUs are running full bore all the time since the entire machine is completely dedicated to that one task. The fact that a spin-wait in the CUDA driver gives the absolute minimum latency after a kernel launch is nice.

What makes the EventQuery method even uglier (if I were to use it) is that the queue depth for kernel calls is only 16 deep, after which an implicit threadsynchronize is performed. My application calls 1000’s of kerenls a second and possibly up to more than the queue depth in a single step of the algorithm so I would somehow have to insert many eventquery sleep loops in the middle of the calculation: very ugly.

[snapback]308859[/snapback]

Topic		Replies	Views
cudaMemcpy() Best approach when you need to call it many times? CUDA Programming and Performance	8	25087	March 8, 2010
CPU Spins while waiting for GPU to finish computation Profiling Linux Targets	6	1387	March 2, 2023
is kernel in stream 0 asynchronous? CUDA Programming and Performance	10	3712	April 23, 2011
100% CPU use while waiting for kernel CUDA Programming and Performance	7	4641	July 10, 2008
Can kernel function parallel with CPU code? CUDA Programming and Performance	12	7737	December 5, 2008
some cuda question CUDA Programming and Performance	6	980	December 23, 2015
CUDA thread in background? CUDA Programming and Performance	10	15994	February 19, 2010
KERNELS are NOT queing , bug in cuda 2.0 ? cudathreadsynchronize() makes no difference ? CUDA Programming and Performance	12	5323	August 17, 2009
cuda visual profiler CUDA Programming and Performance	12	8168	July 30, 2008
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20144	May 4, 2007

Do the non-async calls sleep or burn CPU?

Related topics