Latency host <-> device memory for small blocks


I am currently evaluating the use of CUDA for a low latency real-time signal processing application. As a first step, I measured the time to transfer small blocks of memory from the host to the device and vice versa (the system has two Xeon E5440 processor and a Quadro FX 3700 (G92) connected via PCIe 2.0 x16). Interestingly, the time for transferring blocks up to 1KB is constant (about 10 us) no matter how large the block actually is. Does anybody know whether this is a hardware issue, e.g. because the PCI bus might always transfer at least 1KB? If its a software issue, is there a way to configure the minimal size of a block to be transferred, e.g. using the Driver API instead of C for CUDA? Thank you very much in advance for your help.

Kind regards

The overhead of the queuing framework for both kernel launches and memory copies is roughly 5-15 us, so I suspect this is what you’re hitting.

If you want faster latencies than than that (!) it will be tricky but maybe possible with some cleverness by skipping the queue and running a persistent kernel that polls the system memory via zero-copy memory mapping (a feature added in CUDA 2.2). That’s a pretty advanced strategy, though, one that I am not aware of anyone ever even trying yet.

The lowest latency CUDA devices are the integrated GPUs, since they hang right on the system memory bus and you don’t need the extra memory hops. That would only give the relatively weak 9400m for processing power, but it’d likely have the smallest latency of any GPU.

Incidentally, I’m hoping someone (like tmurray) gets ahold of one of these new Lynnfield Intel processors that go into the LGA-1156 socket motherboards. They integrate the PCI Express 2.0 x16 controller onto the processor itself, so I’m curious if that changes the CUDA latency any…


I doubt that it will have a significant impact on latency. I’ll ask around.

I’ve always wondered that too. From what I see, it seems like many motherboards are using those 16 lanes to split into two x8 slots. (sigh…)

But if there aren’t big latency improvements from adding it to the CPU die, then why bother? CPU area is precious, so they must have some reason.

That reason may not be performance, though, it may be some kind of design savings by allowing simpler bridge chips with no PCIe support of their own at all.

This would also remove some extra traces on the motherboard and thus help motherboard manufacture costs.

This was probably part of the reasoning, for sure. These new processors drop the high-speed QPI interface (25 GB/sec) present on the LGA-1366 chips for something called DMI (2 GB/sec). The slower interface cannot support a PCIe 2.0 x16 link, so they pushed it onto the chip. I don’t know if this was just platform cost-saving (though it did make the die size go up a little) or if this is also designed to improve latency, much like the shift of the memory controller from the Northbridge to the processor itself back when the Opteron came out.

I still suspect most PCIe limited kernels are still being held back by bandwidth and not latency, but as the PCIe standard grows in clock rate, latency might emerge as the performance killer for mid-sized problems.

So, thinking ahead to the future, does the PCIe spec allow bus mastering? So on some future hardware, perhaps the GPUs could find each other and talk to each other with no CPU involved at all?

This will obviously be very useful in the future. (It’d be useful NOW for GPU<->GPU memory transfers!)

As stated above it’s where you could use Mapped Pinned Memory, to enable asynchronous communication between CPU and GPU, using host memory for the communication queues (it doesn’t even need Atomic operations :-) ).

You could communicate GPU->Memory->GPU, CPU->Memory->GPU, GPU->Memory->CPU, anyway, with really low latency (largely under the microsecond!), withour stopping kernel execution or interfering with cpu execution.

Yes, it appears that bus mastering is in the spec:…20An%20FPGA.pdf