Do the non-async calls sleep or burn CPU?

Does cudaMemcpy burn CPU or allow other threads to run while busy? Same if I call a kernal synchronously - the current thread is blocked of course, but is it playing nice and sleeping, or is it in a hard loop burning CPU?

Just wondering as my CPU load seems to go UP when I apply the GPU to an algorithm over just using the CPU…

I’m not sure if the various synchronous memcopys “burn” CPU but I think so.
As for kernels I’m not aware of any way to call a kernel in a synchronous manner. You can call the kernel (which is always asynchronous) and then tell the thread to wait for the kernel with cudaThreadSynchronize(). This will indeed be a busy waiting loop and therefore waste CPU time.

AFAIK cudaMemcpy() (between host and device) is synchronous function, i.e. it doesn’t returns until all data is copied to device.

There is some platform-dependent variation here.

In CUDA 1.1, the Windows version yields the thread if the hardware is not yet idle and there are other active threads.

On Linux, CUDA does not yield the thread.

The implicit waits in synchronous memcpy calls are the same as the busy waits exposed by *Synchronize() calls such as cuCtxSynchronize() and cuEventSynchronize().

Why not? Busy looping is a very bad idea for a library that is meant to take load off the CPU.

If I understand the linux kernel mailinglist sentiment, yielding is not a very smart API anyway (the behaviour is quite undefined) I forgot however which API was preferred.

Blocking the thread on some file descriptor, to wake up when the device is finished, is generally used to wait for hardware events. I think it’s best in this case as well.

Agreed. The point is to offload work from the CPU, letting the CPU get on with other tasks (for example, generating the next batch of data for the GPU to process).

Absolutely you don’t want the CPU burning valuable Hertz while the GPU does it’s thing.

Launch kernel, generate next batch of data or process results of previous invocation and only then synchronize with GPU.

I have other threads that can do work. If the thread that is handling the GPU is burning inside cudaThreadSynchronize(), those other threads are not getting to do as much work as they could. Ideally, cudaThreadSynchronize should be sleeping awaiting some system event to wake. This would let other threads get all the CPU.

GetProcessTimes() (on Windows) returns the CPU usage in user and kernel spaces.

If I run a GPU kernel that does very little , I get a low CPU time.
If I run a GPU kernel that does a lot, I get a high CPU time.

The CPU times should not be affected, should they? GetProcessTimes seems to count the time spent on the GPU. That’s why my original question - it looks like running the GPU also uses CPU time :-(

if in your code you launch a kernel, followed by a memcopy from device to host, the CPU will spin while waiting for the kernel to end before doing the memcopy.

That is where you see the CPU time coming from.

Denis, thank you for the reply. That would explain that… so I don’t need to call cudaThreadSynchronize before cudaMemcpy device->host then? The cudaMemcpy already contains the sync call?

Yes, that’s right.

Any chance that we will see an update where the CPU thread sleeps rather than spinning? Is there anywhere I can usefully submit a change request?

Finally… now the 8800s are available for Mac towers (and the 8600 on the notebooks) any word of CUDA for Mac?

You can add a sleep yourself. Just use the cudaEventQuery function. Just be aware that it will add latency to the kernel execution, of course.

You mean by regularly polling if the event occured from another thread? Isn’t that really ugly…

I didn’t say it wasn’t ugly :) Just that it can be done.
I’ve never tried it except as a simple test, as my targeted applications for CUDA are HPC clusters where it doesn’t matter if the CPUs are running full bore all the time since the entire machine is completely dedicated to that one task. The fact that a spin-wait in the CUDA driver gives the absolute minimum latency after a kernel launch is nice.

What makes the EventQuery method even uglier (if I were to use it) is that the queue depth for kernel calls is only 16 deep, after which an implicit threadsynchronize is performed. My application calls 1000’s of kerenls a second and possibly up to more than the queue depth in a single step of the algorithm so I would somehow have to insert many eventquery sleep loops in the middle of the calculation: very ugly.

Of course you can sleep yourself, but the point is that the API doesn’t do that automatically, and it will waste GPU capacity if you do it yourself and don’t know how long to sleep before the GPU is finished (like when using cudaEventQuery). It should really wake up only when the GPU is finished, without burning CPU time. It might not be a problem in your case, but in a lot of (realistic) cases you want both the CPU and GPU to be maxed out as much as possible.

It’s not necessarily so bad to do polling in many cases. In the case of message passing applications, one often has to kick the underlying message passing system regularly in order to cause asynchronous processing of incoming and outgoing messages to make forward progress. In many cases, the GPU just gets added to such an event loop and is just one more thing to keep track of. Apps that do this sort of thing can implement a dynamic load balancing scheme. By keeping track of timestamps on kernel completions etc one can make a prediction of how long (either wall clock time or cycles through a processing loop) until the GPU needs to be checked up on again. The queue depth limitation may be the biggest problem to overcome for apps that have very fast kernel launch rates. I suspect we’ll see these things evolve and improve as time goes on however.

Cheers,

John