What is the TFLOPS for CUDA/Tensor Cores with FP16 on V100?

202476410arsmart · November 25, 2024, 8:03am

I couldn’t find the TFLOPS value for CUDA Cores with FP16 precision on NVIDIA’s official website… I just find H100 and A100. Could anyone kindly provide that to me?

202476410arsmart · November 25, 2024, 8:04am

For example here:
https://www.nvidia.com/en-us/data-center/h100/

It does not show half cuda core

Robert_Crovella · November 25, 2024, 3:09pm

non-tensor FP16 should be double the FP32 throughput, when operating on half2 type, for add, multiply, and multiply-add, on cc7.0. This is based on the published per-SM throughput.

Curefab · November 25, 2024, 5:15pm

Volta can do 64 FMAs (16-bit) per Tensor core per cycle.
1 FMA has 2 FLOPs (Multiply+Addition).
Volta has 8 Tensor Cores per SM (later generations beginning with Ampere are fixed at 4 Tensor Cores).
V100 has 84 SMs (or 80?) and
depending on sub-model between 1290 and 1455 MHz boost clock frequency (alternatively the base clock depending on how you use it).
Just multiply it together.

Robert_Crovella · November 25, 2024, 5:38pm

the v100 datasheet indicates FP16 tensor core perf

202476410arsmart · November 26, 2024, 7:12am

Thanks!!! So… can I put it this way: FP16’s throughput is 2 times of FP32, so cuda core’s multiply of FP16 is about 30TFLOPS, and Tensor core is about 120TFLOPS? (For V100)

Curefab · November 26, 2024, 11:53am

Yes, on V100 (compute capability 7.0) the 16-bit is double as fast (bandwidth) as 32-bit, see CUDA C++ Programming Guide (chapter Arithmetic Instructions). Sometimes the computation cores can do one bit-width (e.g. 16-bits or 32-bits or 64-bits) or several or only integer or only floating-point or both.

For Tensor Cores you find comparative numbers for different datatypes (with the V100 as the first Tensor Core GPU quite limited to only 16-bits) here: CUDA - Wikipedia (not officially from Nvidia, collected by the community). With Tensor Cores, especially on consumer cards, there is often a difference between different widths of accumulation.

202476410arsmart · November 26, 2024, 12:50pm

“Volta can do 64 FMAs (16-bit) per Tensor core per cycle.” This is very helpful! Do you know that for Hopper?

202476410arsmart · November 26, 2024, 12:51pm

Oh, never mind. I see it here: CUDA - Wikipedia

system · December 10, 2024, 12:51pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to calculate the Tensor Core FP16 performance of H100? CUDA Programming and Performance	9	5563	August 14, 2024
How cuda core compute fp16 data in different nvidia arch？ CUDA Programming and Performance cuda	8	454	November 25, 2024
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2513	August 12, 2017
Double precision tensor core performance on A100 CUDA Programming and Performance cuda , a100 , ampere	1	910	July 7, 2023
Some confuse about TX1 and TX2 FLOPS calculation CUDA Programming and Performance	4	5244	May 31, 2019
Theoretical TFLOPS for FP16, BF16 and TF32 for tensor and non-tensor GPU-Accelerated Libraries	4	4445	June 21, 2022
Parallel usage of FP64 and Tensor cores in H100 CUDA Programming and Performance hw , cuda	3	63	December 6, 2024
FP32 and FP16 activity during a pure 32bit float CUDA application is running CUDA Programming and Performance	4	1061	April 26, 2018
Titan V FP16 Performance CUDA Programming and Performance	5	4215	December 13, 2017
Question about tensor cores performance CUDA Programming and Performance	3	579	October 12, 2021

What is the TFLOPS for CUDA/Tensor Cores with FP16 on V100?

Related topics