For computationally intensive kernels using CUDA 2.0, what is the best way to handle the low level calculations? The visual profiler doesnâ€™t provide a fine enough granularity to optimize specific calculations, so are there any rules of thumb for choosing low level optimization targets.
For example, when using floats, is it best to use the â€œ*â€ multiply operator or _fmul[rn,rz] function, given that the goal is to optimize latency not numeric accuracy?
Another case that comes to mind is using complex arithmetic. What is the best way to perform multi-operation transforms such as a complex conjugate multiply?