Hi, I would like to use all the optimisation techniques possible to gain an understanding of what they do, so I may apply to application I build in the future and hopefully obtain near maximum instruction throughout. Please let me know if I missing any, many thanks
The techniques I am aware of so far are:
Shared Memory
Using Coalesced Global Memory
Prefetching Data
Using large kernels rather than several small kernels to resuce kernel call overhead
Hiding Global Memory Latency using many global memory accesses
Some of the important observations in his presentations are:
[*]maximize instruction level parallelism[*]duplicate/triplicate/etc. work per thread[*]registers can provide significantly more bandwidth than shared memory so bias against shared when possible (esp. on Fermi)[*]try to maximize the number of "in-flight" reads from device memory[*]if you're using __syncthreads() then, if possible, try to use more blocks of fewer threads vs. a monolithic block of threads
These tips can definitely effect your GPU development style.
With all of these techniques (and more) in mind, I’ve found that I try to write kernels that will “max out” the capability of one or more memory spaces – either in size or speed – so that the GPU is fully utilized. For example, if you know your kernel is compute-bound and can use as many registers as it can get then let that be your initial design constraint. Similarly, if your kernel is simply reading>transforming>writing device memory then make sure you’re hitting the peak bandwidth of your device.
Finally, simplify. Small and elegant code is challenging to write but it can wind up being way faster.