Question regarding Tensor Cores/GV100

AlokPatil · August 9, 2017, 10:20pm

Hi

I was referring to the article published by anandtech here.

Based on the article I had questions regarding the precision of compute in the Tensor Cores. Do the Tensor Cores support 8-bit compute or 16-bit FP or 32-bit FP?
If they support more than one of the above, what is the peak performance achieved on each precision? The article mentions the peak performance as 120 TeraFlops but at what precision?
The article also mentions that the FP32 in the cuda cores can be used to perform 2 FP16 ops. If the tensor core is FP16 then why create 2 different sets of FP16 units? Also if both cuda and tensor cores can be used as FP16, can they both be used simultaneously to achieve 150 (=120+30) TeraFlops?

Thanks
Alok

Robert_Crovella · August 9, 2017, 11:48pm

The TensorCore supports a particular (hybrid) combination of 16-bit floating point and 32-bit floating-point for a matrix-matrix multiply operation. I suggest you review the Programming Tensor Cores section here:

https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/

The multiplication input data (A,B) is always 16-bit floating point. The input offset matrix (C) can be either 16-bit floating point or 32-bit floating point. The output (D) can be 16-bit floating point or 32-bit floating point.

There is only one operation (a matrix-matrix multiply and accumulate). This operation uses a combination of 16-bit and 32-bit arithmetic. At peak rate, across the entire GV100 die, the combination of 16-bit Flops and 32-bit Flops delivered during the hybrid matrix-matrix multiply and accumulate is a peak theoretical rate of 120 TFlops.

It should be clear by now that the TensorCore is not a general purpose floating-point ALU. For arbitrary floating point arithmetic, TensorCore would not be used in the general case. Therefore, 16-bit and 32-bit (and 64-bit) floating point ALUs are still provided for general purpose arithmetic uses, i.e. any use case that does not map into a 16-bit input hybrid 32-bit matrix-matrix multiply and accumulate.

AlokPatil · August 10, 2017, 12:20am

Hi

Thanks for the explanation. That clarifies my confusion regarding the precision. Are there any tensor core benchmarks (apart from those provided by NVIDIA) that you could direct me to?
Thanks again for your help.

Regards
Alok

LukeCuda · August 10, 2017, 3:16am

what is the input offset matrix (C)? isnt the bias/offset usually just added to the input (A) matrix as an extra input column?

and why is it a matrix, not a column vector?

Robert_Crovella · August 10, 2017, 2:05pm

The result of a 4x4 matrix multiply is a 4x4 matrix. If A and B are 4x4 matrices, then the product will be a 4x4 matrix. To this product, a 4x4 matrix (C) is added, elementwise.

So, no, it is not just added to the input A, as that would affect the product result, and it is not an extra input column.

Please review the link I already gave, it gives a pictorial example of this.

LukeCuda · August 11, 2017, 12:09am

ok so the point of the tensor is to do tiling on a larger A x B matrix

allanmac · August 11, 2017, 10:48pm

I wish the TensorCore unit supported two additional modes:

4x4 matrix multiply -- perhaps occupying 4 lanes and 2 32-bit registers per matrix
16x16 element-wise FMA

Robert_Crovella · August 11, 2017, 11:15pm

Features get added based on the !/$ they provide. If you can tie those operations to particular, widely used algorithms, that will help add currency to your feature request. Also, you can always submit an RFE.

The TensorCore “team” is actively interested in hearing about the ways you’d like to use TensorCore. If you need a 4x4 op exposed at warp level (as you indicated), or if for example you’d like to see an 8x32 multiply in addition to the 16x16 multiply op exposed, that information is very interesting, but it also helps to have some motivation.

LukeCuda · August 12, 2017, 3:06am

the tensor core really is a game changer. i would like a whole gpu for multiply and accumulate of matrix’.

the whole industry of machine learning is just simple MAC and and element-wise non-linears. rinse and repeat.

they give fancy names but its just brute force MAC (and cheese).

you could do it 200x faster if you made a deep learning gpu (well basically asic, right). its why google did it. nvidia are a bit slow but getting the idea now!

trouble is, NVIDIA Z100 ASIC wont play games, so it will cost too much for average joe me who has to buy his own hardware. gaming gpus commoditised machine learning. and allowed average me to compete with the world. machine learning will go back to the hands of the rich/corporations.

Topic		Replies	Views
Tensor core, is my analysis correct? CUDA Programming and Performance	2	58	February 5, 2025
Why tensor cores can't do FP32 arithmetic? CUDA Programming and Performance hw	4	215	December 10, 2024
Question about tensor cores performance CUDA Programming and Performance	3	649	October 12, 2021
FP32 and FP16 activity during a pure 32bit float CUDA application is running CUDA Programming and Performance	4	1120	April 26, 2018
What is the TFLOPS for CUDA/Tensor Cores with FP16 on V100? CUDA Programming and Performance	9	418	December 10, 2024
How to measure Tensor FLOPs? CUDA Programming and Performance tensorrt , cuda , kernel	14	2398	May 15, 2024
Differences in Precision Between Tensor Cores and CUDA Cores CUDA Programming and Performance cuda	1	95	January 10, 2025
Perfomance question for Tesla V100 CUDA Programming and Performance	11	2502	May 24, 2017
Mixed-Precision Programming with CUDA 8 Technical Blog	1	391	February 23, 2017
Programming Tensor Cores in CUDA 9 Technical Blog	14	1095	November 28, 2022

Question regarding Tensor Cores/GV100

Related topics