"Reduced latency" programming for GPUs? Compiling a list of tips and techniques.

GPUs are known as a throughtput driven technology, whereas CPUs are latency driven. However, there are ways to reduce latency in GPUs while still making use of their awesome number crunching capability. That is to say, there are ways to program GPUs that reduce end-to-end latency for a task. Examples: 1) on-the-fly compression to move data onto and off of the GPU card quickly across the PCI bus (often the bottleneck), 2) persistent (fewer) kernels and multiple (more) kernels both can improve latency depending on the problem.

In answering, it may help to have this working definition of latency in mind: latency is the time it takes for a system to respond to a stimulus and produce a result. To give a concrete example: a piece of data arrives over a network to an application running on a computer; start the clock, call this t0; the application bundles up the data and sends it to the GPU; a kernel is launched reads the data and writes results to GPU memory; data is then transferred back to the application; stop the clock, call this t1. Latency of this system is t1 - t0.

Tips, tricks, techniques, thoughts and ideas?

Improving latency by compressing data to be send over the PCI-express bus is not a good idea, in my opinion. Although the bandwidth of the PCI-express bus is low compared to GPU memory bandwidth, it is still 5GB/s. Compressing data at an outgoing data-rate of 5GB/s is hard (incoming data-rate must be even higher, otherwise you don’t have compression) in software, given CPU memory bandwidth to DDR is about 20-30GB/s. On the fly compression in hardware might help, but I guess this is very costly (takes a lot of hardware to get decent performance).
With PCI-express 3.0 on the doorstep, this bottleneck is sort of solved (for now at least), or you can have a look at CPU-GPU integrated devices, which don’t need PCI-express memory copies.

A better approach (in my opinion) to reduce latency is to chop up the data to be processed in parts, and overlap (in time) transferring one part to GPU with processing another part on the GPU.