GPUs are known as a throughtput driven technology, whereas CPUs are latency driven. However, there are ways to reduce latency in GPUs while still making use of their awesome number crunching capability. That is to say, there are ways to program GPUs that reduce end-to-end latency for a task. Examples: 1) on-the-fly compression to move data onto and off of the GPU card quickly across the PCI bus (often the bottleneck), 2) persistent (fewer) kernels and multiple (more) kernels both can improve latency depending on the problem.
In answering, it may help to have this working definition of latency in mind: latency is the time it takes for a system to respond to a stimulus and produce a result. To give a concrete example: a piece of data arrives over a network to an application running on a computer; start the clock, call this t0; the application bundles up the data and sends it to the GPU; a kernel is launched reads the data and writes results to GPU memory; data is then transferred back to the application; stop the clock, call this t1. Latency of this system is t1 - t0.
Tips, tricks, techniques, thoughts and ideas?