Maximising Perfromance of a Application

Hi, I would like to use all the optimisation techniques possible to gain an understanding of what they do, so I may apply to application I build in the future and hopefully obtain near maximum instruction throughout. Please let me know if I missing any, many thanks

The techniques I am aware of so far are:

    Shared Memory

    Using Coalesced Global Memory

    Prefetching Data

    Using large kernels rather than several small kernels to resuce kernel call overhead

    Hiding Global Memory Latency using many global memory accesses

Have a look at the Best Practices Guide, it’s pretty good.

Many thanks, completely forgot this information was in there

Don’t forget Vasily Volkov’s influential presentations:

Some of the important observations in his presentations are:

    [*]maximize instruction level parallelism[*]duplicate/triplicate/etc. work per thread[*]registers can provide significantly more bandwidth than shared memory so bias against shared when possible (esp. on Fermi)[*]try to maximize the number of "in-flight" reads from device memory[*]if you're using __syncthreads() then, if possible, try to use more blocks of fewer threads vs. a monolithic block of threads

These tips can definitely effect your GPU development style.

With all of these techniques (and more) in mind, I’ve found that I try to write kernels that will “max out” the capability of one or more memory spaces – either in size or speed – so that the GPU is fully utilized. For example, if you know your kernel is compute-bound and can use as many registers as it can get then let that be your initial design constraint. Similarly, if your kernel is simply reading>transforming>writing device memory then make sure you’re hitting the peak bandwidth of your device.

Finally, simplify. Small and elegant code is challenging to write but it can wind up being way faster.