Hi
How to calculate fast reciprocal in kernel (precision is not so important ;)
is the :
invdet = __fdividef(1.0f, det); is fastest way ?
det is > 0 and < 8192
I’m porting some code to CUDA that uses a lot of _mm_rcp_ss/ps in calculations,
as those are famous from lack of precission and the algo deals fine with this
so meybe there is faster and less accurate reciprocal than __fdividef ?
the kernel is computation bound, not memory bound so any cycles saved are the win here.