i recently learning a100 and h100 arch. I have some problems about cuda core flops performance :
referring to h100 white paper, i notice a100 fp16 tflops of cuda core is 4x fp32 tflops in a100. but 2x in h100.
what’s more, fma/ clk per sm is also 4x for fp16 : fp32 in ampere(a100). Can I assume that a cuda core(fp32 unit) can compute 4 fp16 data per clock in a100, but 2 fp16 data per clock in h100?
in sm80s, a sm have 64 fp32 units and 4 warp schedulers. So a warp(32 threads) only is allocated 16 fp32 units. if i launch a kernel to do some compute for fp32 data, only half warp(16 threads) can execute parallelly in a clock . Is that right?
On sm80, we have 256 results per clock cycle per sm for fp16 addition, and 64 results for fp32 addition.
On sm90, it is 256 for fp16, and 128 for fp32.