FP32 and FP16 activity during a pure 32bit float CUDA application is running

cclo · April 26, 2018, 1:48pm

Hi all,

I am a newbie of CUDA.

After I studied some information about FP16 and FP32, I wonder the activity about the FP16 cores are idle or not when I implemented a pure float (32bit) type CUDA application (e.g. Float type matrix multiplication) ?

Please help. Thanks.

cbuchner1 · April 26, 2018, 1:57pm

I would expect the Tensor cores on Titan-V and Tesla V100 to always remain idle unless the dedicated Warp Matrix Multiply and Accumulate API in the nvcuda::wmma namespace is being used in the kernel code.

see here for details about this API (and the support in cuBLAS, cuDNN libraries)
https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

For 32 bit precision multiplications, the tensor cores simply don’t have enough precision (multiply is 16x16bit, accumulation can be either 16 or 32 bit precision), so I don’t see how they could support such operations.

Christian

cclo · April 26, 2018, 2:04pm

Thanks for your quick answer. But I am not familiar with tensor cores. Does the same concept applies to CUDA cores ?

Let me further simplify my example.

If I implemented an CUDA application which is a 32 bit Float type matrix multiplication, where its output is also 32bit. You mean the FP16 cores will keep idle during this application runs ?

Then if I implemented another CUDA application which is a 16 bit Half type matrix multiplication, where its output may be 16(Half) or 32bit (Float). In this case, both FP32 and FP16 cores will be activated ?

Please correct me if I am wrong.

Thanks.

Robert_Crovella · April 26, 2018, 2:51pm

for architecture sm_60, and I suspect for architectures sm_53 (TX1), and sm_62 (TX2), AFAIK, the FP16 throughput is implemented via a special mode on the FP32 cores. Therefore, in a given instruction cycle, an FP32 instruction could be scheduled on that core, or an FP16 instruction (processing half2), but not both.

Support for the sm_60 claim is here:

[url]https://devtalk.nvidia.com/default/topic/1001191/cuda-programming-and-performance/how-fp32-and-fp16-units-are-implemented-in-gp100-gpus/post/5115158/#5115158[/url]

At least on sm_60, according to this claim, the FP32 and FP16 cores are the same functional unit within the SM. I’m not sure you’ll find much actual NVIDIA documentation or specification to this effect. It could possibly be confirmed via microbenchmarking, although probably pretty difficult to do so.

Architectural arithmetic throughputs can be determined from this table:

[url]Programming Guide :: CUDA Toolkit Documentation

but they are not all guaranteed to be achievable at the same time.

cbuchner1 · April 26, 2018, 2:58pm

ah, ok. OP was referring to half precision arithmetics supported by some specific chips
(Tesla P100, Quadro GP100, and Jetson TX1/TX2) and not the tensor cores.

https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____HALF__ARITHMETIC.html

Christian

Topic		Replies	Views
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2532	August 12, 2017
Why tensor cores can't do FP32 arithmetic? CUDA Programming and Performance hw	4	147	December 10, 2024
Question about tensor cores performance CUDA Programming and Performance	3	631	October 12, 2021
16 bit float operations CUDA Programming and Performance	2	7574	April 7, 2015
How FP32 and FP16 units are implemented in GP100 GPU's CUDA Programming and Performance	8	7485	March 28, 2017
What is the TFLOPS for CUDA/Tensor Cores with FP16 on V100? CUDA Programming and Performance	9	273	December 10, 2024
Multiplication in Half and Accumulation in Single CUDA NVCC Compiler	0	524	July 3, 2022
Understanding of Tensor Core, Cuda Core and other cores in Ampere architecture CUDA Programming and Performance tensorrt , cuda	8	4027	December 3, 2022
Titan V FP16 Performance CUDA Programming and Performance	5	4231	December 13, 2017
Nsight Profile of NVIDIA/CUDALibrarySamples/cuTENSOR. Does it use CUDA Programming and Performance	4	515	November 22, 2022

FP32 and FP16 activity during a pure 32bit float CUDA application is running

Related topics