I couldn’t find the TFLOPS value for CUDA Cores with FP16 precision on NVIDIA’s official website… I just find H100 and A100. Could anyone kindly provide that to me?
For example here:
https://www.nvidia.com/en-us/data-center/h100/
It does not show half cuda core
non-tensor FP16 should be double the FP32 throughput, when operating on half2 type, for add, multiply, and multiply-add, on cc7.0. This is based on the published per-SM throughput.
Volta can do 64 FMAs (16-bit) per Tensor core per cycle.
1 FMA has 2 FLOPs (Multiply+Addition).
Volta has 8 Tensor Cores per SM (later generations beginning with Ampere are fixed at 4 Tensor Cores).
V100 has 84 SMs (or 80?) and
depending on sub-model between 1290 and 1455 MHz boost clock frequency (alternatively the base clock depending on how you use it).
Just multiply it together.
the v100 datasheet indicates FP16 tensor core perf
Thanks!!! So… can I put it this way: FP16’s throughput is 2 times of FP32, so cuda core’s multiply of FP16 is about 30TFLOPS, and Tensor core is about 120TFLOPS? (For V100)
Yes, on V100 (compute capability 7.0) the 16-bit is double as fast (bandwidth) as 32-bit, see CUDA C++ Programming Guide (chapter Arithmetic Instructions). Sometimes the computation cores can do one bit-width (e.g. 16-bits or 32-bits or 64-bits) or several or only integer or only floating-point or both.
For Tensor Cores you find comparative numbers for different datatypes (with the V100 as the first Tensor Core GPU quite limited to only 16-bits) here: CUDA - Wikipedia (not officially from Nvidia, collected by the community). With Tensor Cores, especially on consumer cards, there is often a difference between different widths of accumulation.
“Volta can do 64 FMAs (16-bit) per Tensor core per cycle.” This is very helpful! Do you know that for Hopper?
Oh, never mind. I see it here: CUDA - Wikipedia
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.