I’m not sure why it sounds conflicting. The ampere whitepaper indicates the presence of 2 datapaths for each SMSP, where each datapath has 16 FP units.
If we consider things at the datapath level (and I have already indicated that I suspect this is the proper view, but cannot confirm with documentation) then it should be clear that in a single clock cycle, an instruction will be issued to either the FP32 data path, or else to the combined FP32/INT32 datapath, but not both, in the same cycle, for the same warp/instruction.
Given that, the behavior is roughly consistent with the other post you linked, where I stated that If the SMSP does not have 32, but has, instead, 16, then it will require 2 clocks to fully issue the instruction. So I think this datapath view has to be considered.
It is possible I am wrong. One of the reasons I don’t think I am wrong is that the two datapaths do have somewhat different capabilities. One is FP32 only while the other can handle either FP32 or INT32. Clearly, INT32 is a different instruction, and since I am also fairly convinced that the warp schedulers are not dual-issue capable, there is no way to get both datapaths issued in the same clock cycle if you are issuing INT32 mixed with FP32. Therefore I conclude the numbers presented (mostly) are throughput numbers. To get full throughput, you would have to issue alternating INT32 and FP32 instructions, I suspect. In the FP32 only case, you would still get full throughput by issuing back-to-back FP32 instructions, alternating cycle-by-cycle to each datapath. The only quibble would be about whether or not a sequence requires one extra clock or not due to alternation of issue, vs. split-issue. I find such considerations to be outside the realm of anything I care about or can describe. I doubt they matter from an actual performance perspective.
Anyhow, I don’t know of detailed published descriptions of scheduler behavior at this level. You’re welcome to ask questions about it, of course, but from a programmer’s perspective, in my opinion such investigation is mostly irrelevant. You don’t control detailed SASS instruction scheduling, and the machine is largely a throughput machine anyway. It is designed to give good performance even without the programmer attempting/being able to do instruction-by-instruction scheduling. NVIDIA generally does not provide tools to give the programmer this level of control.