I’m working on a n-body force problem and I’ve noticed that if I change the expression of distance such that I use multiplication instead of division (which gives me the wrong answer), I get about 50-75% improvement in kernel speed. This is surprising given that I have about 20-30 floating point operations (all multiplications and additions) in the kernel and I am merely changing one multiplication to a division.
a) is this expected given the long cycle of the division operator (Fermi Tesla, CUDA 3.1)?
B) I’ve noticed that I am still no where near compute bound (~10% of peak double precision performance). Does this mean that I’m memory bound? If I am memory bound, why would there be such a big change upon going from division from multiplication?
Thanks in advance.