I see, when searching before I found a response you provided before that sounds a bit conflicting.
In such case, for the ALU performing the FP32 (aka the “CUDA core”), and in the case of Ampere gaming and later, they do “technically” have 32 capable FP32 units per SMSP. However, it’s not fully clear to me since it’s Ampere and not Turing, with Turing only having 16 of those capable units per SMSP (64 across the entire SM) meanwhile ampere gaming and later have 128 capable per SM (or 32 per SMSP).