- Why does case 1 have Stall Tex Throttle listed as a stall reason (3rd place in the list of stalls), while case 2 and 3 don’t? I am not using/defining any texture memory. Does F2F use texture memory by any chance?
On CC 7.5 (Turing) and consumer focused GPUs the F2F.F64.F32 is issued to a reduced throughput FP64 unit that is shared by all 4 warp schedulers. The FP64 unit and register write-back share the same data path as the texture unit.
- Why does case 2 not have Stall Tex Throttle? If the conversions from one datatype to another uses Tex memory, shouldn’t
I2Fcause a Tex throttle too?
On most GPUs I2F and F2F (32-bit float only) is implemented in the XU/SFU (special function unit) that is per warp scheduler but issued through MIO. These instructions will have a stall reason of mio_throttle and instrutions dependent on the result will have short_scoreboard.
The XU/SFU throughput is much higher than the FP64 throughput on CC 7.5 so you are likely to see a lot more stalls in the F2F.FP64 case than the other 2 cases.