Its SASS, so its not well documented. However you can get some insight by studying the corresponding PTX instruction. This may be of interest.
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
Bitslice-DES optimization | 55 | 12617 | January 29, 2022 | |
Used Registers vs Live Registers | 14 | 3406 | June 28, 2020 | |
A more accurate, performance-competitive implementation of expf() | 24 | 8331 | November 19, 2017 | |
Division problem (weird behavior) | 23 | 18037 | November 15, 2010 | |
On the register allocation optimization of cuda compiler | 12 | 3280 | January 20, 2019 | |
Blackwell Integer | 136 | 2477 | June 13, 2025 | |
Faster and more accurate implementation of log1pf() | 15 | 3311 | January 25, 2017 | |
Optimized version of single-precision error function, erff() | 21 | 4484 | December 25, 2017 | |
On the utility of SFU instructions for half-precision math functions | 8 | 2450 | September 16, 2019 | |
Using fast_math used to be much faster on arch 2.0 and 3.0 but is now even slower on arch 3.5 and up ! | 19 | 2218 | October 31, 2016 |