How cuda core compute fp16 data in different nvidia arch？

844486694 · February 26, 2024, 5:53am

i recently learning a100 and h100 arch. I have some problems about cuda core flops performance ：

referring to h100 white paper, i notice a100 fp16 tflops of cuda core is 4x fp32 tflops in a100. but 2x in h100.
what’s more, fma/ clk per sm is also 4x for fp16 : fp32 in ampere(a100). Can I assume that a cuda core（fp32 unit） can compute 4 fp16 data per clock in a100， but 2 fp16 data per clock in h100?

image1043×1028 64 KB

image2474×1289 333 KB
in sm80s， a sm have 64 fp32 units and 4 warp schedulers. So a warp（32 threads） only is allocated 16 fp32 units. if i launch a kernel to do some compute for fp32 data, only half warp(16 threads) can execute parallelly in a clock . Is that right?

image919×1055 299 KB

striker159 · February 26, 2024, 7:37am

Table 4 in the programming guide lists the throughput of arithmetic instructions for all architectures.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions-throughput-native-arithmetic-instructions

On sm80, we have 256 results per clock cycle per sm for fp16 addition, and 64 results for fp32 addition.
On sm90, it is 256 for fp16, and 128 for fp32.

202476410arsmart · November 25, 2024, 8:47am

But it does not distinguish between cuda/tensor core…?

202476410arsmart · November 25, 2024, 8:49am

And it seems to be add, or simple operations but not mma?

striker159 · November 25, 2024, 8:51am

Tensor core throughput is not included in that table.

202476410arsmart · November 25, 2024, 8:56am

oh, thank you for your reply! Well, do you know the cuda/tensor core TFLOPS of previous arch like V100?

striker159 · November 25, 2024, 8:58am

Tensorcore FLOPS are listed in the data sheets.

202476410arsmart · November 25, 2024, 8:59am

well, I do find ampere and hopper’s data. But no V100’s…

rs277 · November 25, 2024, 5:54pm

"Tesla V100’s Tensor Cores deliver up to 125 Tensor TFLOPS for training and inference
applications. "

Topic		Replies	Views
What is the TFLOPS for CUDA/Tensor Cores with FP16 on V100? CUDA Programming and Performance	9	270	December 10, 2024
About the relationship between warp and tensor_core CUDA Programming and Performance	7	1282	July 7, 2023
Double precision tensor core performance on A100 CUDA Programming and Performance cuda , a100 , ampere	1	932	July 7, 2023
Separate CUDA Core pipeline for FP16 and FP32? Nsight Compute	11	366	August 20, 2024
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2532	August 12, 2017
Tensor core, is my analysis correct? CUDA Programming and Performance	2	45	February 5, 2025
Mma m8n8k4 on A100 CUDA Programming and Performance	10	95	November 14, 2024
Fp8 conversion performance makes it slower than float16 CUDA Programming and Performance	7	53	March 3, 2025
How to calculate the Tensor Core FP16 performance of H100? CUDA Programming and Performance	9	5835	August 14, 2024
fp16 vs fp32 CUDA Programming and Performance	3	3891	November 13, 2017

How cuda core compute fp16 data in different nvidia arch？

Related topics