How cuda core compute fp16 data in different nvidia arch?

i recently learning a100 and h100 arch. I have some problems about cuda core flops performance :

  1. referring to h100 white paper, i notice a100 fp16 tflops of cuda core is 4x fp32 tflops in a100. but 2x in h100.
    what’s more, fma/ clk per sm is also 4x for fp16 : fp32 in ampere(a100). Can I assume that a cuda core(fp32 unit) can compute 4 fp16 data per clock in a100, but 2 fp16 data per clock in h100?

  2. in sm80s, a sm have 64 fp32 units and 4 warp schedulers. So a warp(32 threads) only is allocated 16 fp32 units. if i launch a kernel to do some compute for fp32 data, only half warp(16 threads) can execute parallelly in a clock . Is that right?
1 Like

Table 4 in the programming guide lists the throughput of arithmetic instructions for all architectures.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions-throughput-native-arithmetic-instructions

On sm80, we have 256 results per clock cycle per sm for fp16 addition, and 64 results for fp32 addition.
On sm90, it is 256 for fp16, and 128 for fp32.

1 Like

But it does not distinguish between cuda/tensor core…?

And it seems to be add, or simple operations but not mma?

Tensor core throughput is not included in that table.

1 Like

oh, thank you for your reply! Well, do you know the cuda/tensor core TFLOPS of previous arch like V100?

Tensorcore FLOPS are listed in the data sheets.


well, I do find ampere and hopper’s data. But no V100’s…

"Tesla V100’s Tensor Cores deliver up to 125 Tensor TFLOPS for training and inference
applications. "