I’m using rcp.approx to implement faster 32-bit integer quotient and remainder than provided by PTX integer instructions. On all the architectures I’ve tested rcp.approx on (GTX 980, GTX 1080, P100 and V100) it has the same properties I’m relying on to establish correctness of my code, and I’m wondering whether there is a level of consistency across architectures that NVIDIA guarantees for PTX instructions. Specifically, can I expect rcp.approx to return the same result for all integer values in the range 1.0 to 2^22, no matter what architecture it’s executed on? If not, is there something stronger than what’s published about accuracy in the PTX manual–or is that all I can rely on?
You can only rely on those guarantees that NVIDIA is willing to make, i.e. whatever is published in the documentation (but documentation can have bugs, too). In particular, there are no guarantees for bit-wise identical results from MUFU instructions across GPU architectures.
In practical terms, the special function unit doesn’t seem to have functionally changed for the past dozen years, other than that MUFU.SQRT was added in Pascal. Any changes to results returned by MUFU instructions are unlikely at this point.
If your code relies on tighter requirements than what NVIDIA guarantees in the documentation it provides, you could add a validation step to the initialization of your software component to ensure that any future GPU architecture meets those requirements. An exhaustive test of 2**22 test cases will take very little time.
Thank you for your response. I have been treating this issue in the way you suggest, and currently have a validation step in the installation phase of the software.
I posted this question in the hopes that somewhere in the documentation there was a stronger statement on rcp.approx than the one line on absolute accuracy that is in the PTX manual that someone might be able to point me to, but I imagine that is not the case.