Question regarding Tensor Cores/GV100

Hi

I was referring to the article published by anandtech here.

  1. Based on the article I had questions regarding the precision of compute in the Tensor Cores. Do the Tensor Cores support 8-bit compute or 16-bit FP or 32-bit FP?

  2. If they support more than one of the above, what is the peak performance achieved on each precision? The article mentions the peak performance as 120 TeraFlops but at what precision?

  3. The article also mentions that the FP32 in the cuda cores can be used to perform 2 FP16 ops. If the tensor core is FP16 then why create 2 different sets of FP16 units? Also if both cuda and tensor cores can be used as FP16, can they both be used simultaneously to achieve 150 (=120+30) TeraFlops?

Thanks
Alok

The TensorCore supports a particular (hybrid) combination of 16-bit floating point and 32-bit floating-point for a matrix-matrix multiply operation. I suggest you review the Programming Tensor Cores section here:

https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/

The multiplication input data (A,B) is always 16-bit floating point. The input offset matrix (C) can be either 16-bit floating point or 32-bit floating point. The output (D) can be 16-bit floating point or 32-bit floating point.

There is only one operation (a matrix-matrix multiply and accumulate). This operation uses a combination of 16-bit and 32-bit arithmetic. At peak rate, across the entire GV100 die, the combination of 16-bit Flops and 32-bit Flops delivered during the hybrid matrix-matrix multiply and accumulate is a peak theoretical rate of 120 TFlops.

It should be clear by now that the TensorCore is not a general purpose floating-point ALU. For arbitrary floating point arithmetic, TensorCore would not be used in the general case. Therefore, 16-bit and 32-bit (and 64-bit) floating point ALUs are still provided for general purpose arithmetic uses, i.e. any use case that does not map into a 16-bit input hybrid 32-bit matrix-matrix multiply and accumulate.

Hi

Thanks for the explanation. That clarifies my confusion regarding the precision. Are there any tensor core benchmarks (apart from those provided by NVIDIA) that you could direct me to?
Thanks again for your help.

Regards
Alok

what is the input offset matrix (C)? isnt the bias/offset usually just added to the input (A) matrix as an extra input column?

and why is it a matrix, not a column vector?

The result of a 4x4 matrix multiply is a 4x4 matrix. If A and B are 4x4 matrices, then the product will be a 4x4 matrix. To this product, a 4x4 matrix (C) is added, elementwise.

So, no, it is not just added to the input A, as that would affect the product result, and it is not an extra input column.

Please review the link I already gave, it gives a pictorial example of this.

ok so the point of the tensor is to do tiling on a larger A x B matrix

I wish the TensorCore unit supported two additional modes:

  • 4x4 matrix multiply -- perhaps occupying 4 lanes and 2 32-bit registers per matrix
  • 16x16 element-wise FMA

Features get added based on the !/$ they provide. If you can tie those operations to particular, widely used algorithms, that will help add currency to your feature request. Also, you can always submit an RFE.

The TensorCore “team” is actively interested in hearing about the ways you’d like to use TensorCore. If you need a 4x4 op exposed at warp level (as you indicated), or if for example you’d like to see an 8x32 multiply in addition to the 16x16 multiply op exposed, that information is very interesting, but it also helps to have some motivation.

the tensor core really is a game changer. i would like a whole gpu for multiply and accumulate of matrix’.

the whole industry of machine learning is just simple MAC and and element-wise non-linears. rinse and repeat.

they give fancy names but its just brute force MAC (and cheese).

you could do it 200x faster if you made a deep learning gpu (well basically asic, right). its why google did it. nvidia are a bit slow but getting the idea now!

trouble is, NVIDIA Z100 ASIC wont play games, so it will cost too much for average joe me who has to buy his own hardware. gaming gpus commoditised machine learning. and allowed average me to compete with the world. machine learning will go back to the hands of the rich/corporations.