Kernel calculation optimization Best way to perform low level calculatio

For computationally intensive kernels using CUDA 2.0, what is the best way to handle the low level calculations? The visual profiler doesn’t provide a fine enough granularity to optimize specific calculations, so are there any rules of thumb for choosing low level optimization targets.

For example, when using floats, is it best to use the “*” multiply operator or _fmul[rn,rz] function, given that the goal is to optimize latency not numeric accuracy?

Another case that comes to mind is using complex arithmetic. What is the best way to perform multi-operation transforms such as a complex conjugate multiply?

I just stumbled upon this website…perhaps it will help you out:

http://www.cs.rug.nl/~wladimir/decuda/

Here are a few guidelines:

  1. keep your code simple. people with experience are often telling that the simplest code was the fastest in the end. I also find myself often trying to rewrite kernels for more speed, but the straightforward one turned out to be fastest.
  2. coalesce your memory reads and writes.
  3. avoid bank conflicts in shared memory
  4. avoid excessive branching (divergent branching within a warp)

When you have made a kernel, the first thing to do is count how much memory is read & written. Then calculate how much GB/s you are doing. Compare with the bandwidth reported by bandwidthtest (device->device). If you are not getting close to that number, only then is it important to optimize calculation. In a LOT of cases the performance is bound by the memory bandwidth (all of my kernels, and I have yet to encounter someone who has a computation-bound kernel).

Often it is faster to redo calculations if it means you do not have to read in memory…