DMA and pinned memory


I am interested in trying to perform real-time image processing using a PCIe framegrabber and a GPU. I would like to transfer data from the framegrabber into the GPU for processing. I understand from previous posts on this forum that it is not possible to perform a pcie endpoint-to-endpoint transfer into GPU memory. Instead, I should declare some pinned memory via cudaHostAlloc(), arrange to have the framegrabber DMA into this memory, and at the end of the transfer I should have the GPU perform its own DMA via cudaMemcpyAsync().

My question to the forum is to ask what the best way of signalling the GPU that the framegrabber DMA has completed. The framegrabber itself issues an interrupt at the end of frame transfer, which can be caught by the host via the API provided by the framegrabbers driver. Is there a preferred mechanism for host - gpu signalling? Latency is an issue for me, so I would like to minimize overhead in this operation.

Many thanks

The most reliable and best supported host<->GPU signaling method is a kernel launch. :)

Have you timed how long a no-op kernel (use a realistic block and grid configuration, since that affect this) launch takes? Does it meet your latency requirement?

I’m trying to signal the GPU that the framegrabber DMA into host memory has completed, so that the GPU can access this memory. Can you tell me how a kernel launch on the GPU accomplishes this?

Many thanks

When the frame grabber signals to the CPU that the DMA is complete, the CPU thread launches the CUDA kernel to process it.

Really, the thrust of my question was whether the latency in waiting and launching the kernel to process your frame when it is ready (rather than having it already running on the GPU waiting for a signal from the host to start) is acceptable.

Ah, I understand your question now - I’m just beginning to look at CUDA, so I’m a little slow on the uptake.

I have read on this forum that the latency associated with launching the kernel is of order ten microseconds. (Different posts have stated various numbers). For my app, this would be a substantial overhead. It would be much better to have the kernel running on the GPU and then signal from the host to start reading memory. Is there a way to do so?

Many thanks

You should assume that there isn’t.

I should assume that there isn’t, but there really is?

siebert states: “The most reliable and best supported host<->GPU signaling method is a kernel launch”

Both of you seem to be implying that less reliable and more poorly supported methods are available for performing host to gpu signalling with lower latency.

Ten microseconds is quite a long time for me - 1/10th of my latency budget, which needs to include memory transfers and processing. To have to relaunch the kernel and incur this delay for each and every image frame transferred from the framegrabber is not a solution that I’m inclined to take to my employer. This signalling doesn’t seem like it should be difficult.

Many thanks

No there isn’t. The last couple of generations of GPUs support “zero copy” memory, which allows the GPU to directly read and write host memory, but the model doesn’t guarantee read after write coherence without a synchronization event between the two (and that has to be the kernel on the GPU exiting).

If you are really bound by a total latency budget of a 100 microseconds, then I doubt CUDA is really what you need. FPGAs driven by a hard RTOS sounds more like what you are after.

Thank you for responding, though your response was not very encouraging. I would like to try one last avenue of inquiry.

From your note, it seems that the issue lies in ensuring read/write coherence for the gpu. From this, I take it to mean that a gpu write to host memory followed by a gpu read from host memory is not guaranteed to return what was written.

I’m actually interested in the case where the gpu does not write to the host memory at all, but just receives a signal from the host. That is, the host is guaranteeing memory coherence at the time of the signal. This seems like a weaker requirement than the one you described.
Is it possible that cuda could support this lesser case?

If launch overhead is important, you should definitely take a look at the new Fermi GPUs. Just wrote another microbenchmark to see what the execution time is on a no-op kernel:

Device name: GeForce GTX 295:

Launch config: 1000 blocks, 256 threads per block

No op, synchronized: 18.287 us per launch

No op, unsynchronized: 8.993 us per launch
Device name: GeForce GTX 470:

Launch config: 1000 blocks, 256 threads per block

No op, synchronized: 11.391 us per launch

No op, unsynchronized: 4.613 us per launch

This is an Intel Core i7, ASUS P6T7 Supercomputer Motherboard and the CUDA 3.1 beta. The synchronized benchmark includes a cudaThreadSynchronized() after each kernel launch, simulating the case of having to wait on the CPU side for the kernel to finish before continuing. The unsynchronized case just launches kernels as fast as possible. (The driver queue will still run kernels in order, of course.)

(Edit: OS is Ubuntu 9.10, 64-bit.)

These numbers depend pretty strongly on the number of blocks. If I do 512 threads per block and 42 blocks, then the GTX 470 gets 8.085 us for synchronized no ops, and unsynchronized no ops are 1.688 us.

the latency of a kernel launch is more like 2.2us, not 10.

<-- the guy who does more latency measurements than everyone else

The issues are the same the other way also. There is a white paper in the SDK that discusses this API functionality and it says this:

(my emphasis added)

With every new release, I keep wishing that this is the makings of a semaphore mechanism for a persistent kernel, and I keep being told (and finding out) that it isn’t.

It might be more fun if you guys started mentioning OSes that you run those benchmarks on.

Ubuntu 9.10, 64-bit. (also added to post above)

Guys, please also state if you’re using driver or runtime API with those microbenchmarks, it does make a difference.