FP32 and FP16 activity during a pure 32bit float CUDA application is running

Hi all,

I am a newbie of CUDA.

After I studied some information about FP16 and FP32, I wonder the activity about the FP16 cores are idle or not when I implemented a pure float (32bit) type CUDA application (e.g. Float type matrix multiplication) ?

Please help. Thanks.

I would expect the Tensor cores on Titan-V and Tesla V100 to always remain idle unless the dedicated Warp Matrix Multiply and Accumulate API in the nvcuda::wmma namespace is being used in the kernel code.

see here for details about this API (and the support in cuBLAS, cuDNN libraries)

For 32 bit precision multiplications, the tensor cores simply don’t have enough precision (multiply is 16x16bit, accumulation can be either 16 or 32 bit precision), so I don’t see how they could support such operations.


Thanks for your quick answer. But I am not familiar with tensor cores. Does the same concept applies to CUDA cores ?

Let me further simplify my example.

If I implemented an CUDA application which is a 32 bit Float type matrix multiplication, where its output is also 32bit. You mean the FP16 cores will keep idle during this application runs ?

Then if I implemented another CUDA application which is a 16 bit Half type matrix multiplication, where its output may be 16(Half) or 32bit (Float). In this case, both FP32 and FP16 cores will be activated ?

Please correct me if I am wrong.


for architecture sm_60, and I suspect for architectures sm_53 (TX1), and sm_62 (TX2), AFAIK, the FP16 throughput is implemented via a special mode on the FP32 cores. Therefore, in a given instruction cycle, an FP32 instruction could be scheduled on that core, or an FP16 instruction (processing half2), but not both.

Support for the sm_60 claim is here:


At least on sm_60, according to this claim, the FP32 and FP16 cores are the same functional unit within the SM. I’m not sure you’ll find much actual NVIDIA documentation or specification to this effect. It could possibly be confirmed via microbenchmarking, although probably pretty difficult to do so.

Architectural arithmetic throughputs can be determined from this table:


but they are not all guaranteed to be achievable at the same time.

ah, ok. OP was referring to half precision arithmetics supported by some specific chips
(Tesla P100, Quadro GP100, and Jetson TX1/TX2) and not the tensor cores.