Using GPU<->CPU polling to reduce overhead

Hi all,

I read in this forum that typical overheads for CUDA drivers are:

  • call of GPU function from CPU: ~30 us.
  • Memory read or write : ~15us

I try to use CUDA in a servo loop system. The servo loop
is running at 1kHz, and the goal is to process one sample in 100us.
(OS jitter is not an issue)

So if I do:

  • Transfert inputs from CPU to GPU
  • Launch kernel on GPU
  • Read back the outputs from GPU to CPU
    I will get something like 70us overhead (eating most of my 100us budget :( )

So my question is: Can I have a GPU kernel polling for new inputs,
and a CPU thread polling for new outputs to reduce the overhead ?

The GPU kernel would look like:

  • while (1)
    – read a flag in device memory
    – if flag changed
    — process inputs from device memory
    — write outputs in device memory

The CPU thread would look like:

  • while (1)
    – read inputs from some I/O
    – write inputs+flag in device mem
    – while (1)
    — read outputs in device memory
    — if new outputs ready
    ---- break
    — write outputs to some I/O

This would at least remove 30us of overhead :)

GPU cannot poll for new inputs. The only way to give GPU work is to launch a kernel.

CUDA 1.1 adds functionality, allowing CPU to query the status of GPU operations. In short, you can insert events into a stream of cuda calls (memcopies, kernels). Then your CPU code can query whether a particular event has been recorded (meaning all preceding cuda operations have completed). Query is non-blocking. You can also call a blocking function, which returns only after an event has been recorded. And, of course, you still have the API from CUDA 1.0.


Thank you for your answer.

Why exactly GPU cannot poll for inputs ?

I thought that, as both CPU and GPU can access to the device memory, it

was possible to set a synchronization flags there .

Yes this sounds quite impossible, you’d need to have the GPU busy-looping in an infinite loop for the value to change, during which you cannot do transfers to/from it

Sounds like an awesome application! Any chance you would be able/willing to describe what your project is?

The new CUDA 1.1 has improved this some. Obviously timings will depend on the amount of memory you need to copy, but I’ve made measurements of single float copies and the time to execute an empty kernel with a single float parameter.

Memcpy device to host: 15us
Memcpy host to device: 15us
Kernel call (1000 blocks, 128 threads each): 10 us
Kernel call (100 blocks, 128 threads each): 5 us

Even larger kernel launches require more overhead.

You can also do asynchronous memory copies in CUDA 1.1. So, if you have anything to that needs to be done on the CPU, and can be done while waiting for your result to come back, it can be overlapped.
Edit: One additional point. If the input data that you need to copy to your kernel is only a float (or a couple floats), pass them as arguments to the kernel instead of memcpys. The overhead for transferring that value to the device is included in the overhead of the kernel call.

Did you also time for larger kernels, for example, do you know if the 100% cpu usage problem while the CPU is waiting for the GPU still exists?

CPU does not wait for the kernels (even if multiple issues) when the streaming API is used.

There’s an asyncAPI sample in the SDK, that launches a kernel and memcopies asynchronously, then has the CPU go into a spin-loop, waiting for the GPU operations to complete. It also displays the time spend by the GPU as well as the time spend by the CPU in the cuda calls.


The normal CUDA 1.0 API memcpy calls and the like still block with 100% cpu usage as before. So do cudaThreadSynchronize() and the new cudaEventSynchronize() that lets you sync to a particular recored point in the stream.

I’ve played around with the new streaming API and it is possible to write a while(1) loop with a usleep(1) (perhaps a yield would work too, I didn’t try) that checks cudaEventQuery to wait until the stream of async calls has caught up. There are two gotchas that may or may not be a problem depending on your application.

  1. While the sleeping wait gets CPU usage down to 0, it does introduce an added latency in determining that end point.
  2. If I run 16 kernel calls in a row, they all run asynchronously. The call to the 17th (and 18th and 19th …) blocks with 100% CPU usage. I’m guessing this is because of some “queue depth” for async kernel calls. I may have missed it, but I didn’t see this limit mentioned in the documentation.

If you are running small numbers of long kernels, I see the new streaming API as very useful. For my application (which runs 2000+ short kernel calls per second) it’s usefulness seems limited. At least I can overlap CPU calculations with memory transfers now, there are a few places I can make use of that.

Yes, the asyncAPI makes it a lot less common to have to wait for the GPU, but sometimes you do need the result as soon as possible, and thus dedicate a CPU thread to waiting.

In CUDA versions (up to 1.0?) this means the CUDA library goes into a busy loop polling a memory location, effectively requiring 100% CPU, which is not good in a multithreaded program if you use the GPU to unload the CPU in the first place.

As MisterAnderson states, you can make the polling less CPU intensive by putting a nanosleep in it. But still, this is a case in which using a GPU->CPU interrupt and selective wakeup would be much better.

I know there is a TRAP instruction in ptx according to the ptx isa doc, which should generate a host interrupt, just don’t know how to use it :)

reading all your interesting answers, my understanding is that it should be possible
to spin loop on the GPU to wait for inputs, and spin loop on the CPU to wait for outputs.
Using 100% CPU while the GPU is processing one packet is not an issue for me.
There’s plenty of time between packets for other processes.

But as MisterAnderson42 noted, the overhead for a kernel call may be small (5us) compared to memory transfer latency (15us).
So fighting to reduce the kernel call overhead is probably pointless.

The greatest solution would be to use Remote DMA to directly feed the GPU through
an Infiniband link. But is would require for GPU memory to be visible from other PCIe slots.

Does anybody knows if it is feasible ?

It is not possible to spin loop on the GPU and wait for inputs. The block execution architecture just doesn’t mesh with the concept of waiting.

About the infinniband DMA: it has been mentioned in the past on these forums that a method for such a transfer is under consideration, but it doesn’t appear in any of the future planned features lists so don’t expect it soon.

Interesting idea.

Have you determined that this is not the case? Note that I have no idea how infiniband, or inter-card memory mapping works, but the first 256MB of card memory is always mapped into CPU physical memory. Doesn’t this make it possible for other cards to write into it, or is another step needed for that?