fast reciprocal in kernel

Hi

How to calculate fast reciprocal in kernel (precision is not so important ;)
is the :

invdet = __fdividef(1.0f, det); is fastest way ?

det is > 0 and < 8192

I’m porting some code to CUDA that uses a lot of _mm_rcp_ss/ps in calculations,
as those are famous from lack of precission and the algo deals fine with this
so meybe there is faster and less accurate reciprocal than __fdividef ?

the kernel is computation bound, not memory bound so any cycles saved are the win here.

I think if you write:

float inv = 1.0f / det;

The compiler is smart enough to use the hardware reciprocal. You could check the PTX output to be sure.