Deep Learning usually employs layered artificial neural nets.

Each neuron in each layer has a set of connection weights (multiplicative factors) which are used to compute the neuron output, based on the outputs from the previous layer.

F(yi) = w1x1+w2x2+w3x3…

Therefore, each neuron in layer y has a corresponding w vector of weights, which are used to multiply the outputs of the previous layer (x vector).

Taken together, layer y computation is a matrix-vector multiply. With some additional magic using a method such as batching or convolutional neural nets, we can convert this matrix-vector multiply into a matrix-matrix multiply.

The TensorCore accelerates 4x4 chunks of these matrix-matrix multiply operations that are used in DNNs.

For neural networks, it may be sufficient to express the weight vector (matrix) as FP16 quantities, and likewise to express individual neuron outputs as FP16 quantities, but computing the matrix-matrix product may work better if the accumulation operation is done against an FP32 reduction variable.

This is a hand-waving description of the motivation for this type of operation with hybrid (mixed FP16/FP32) data.

This description tends to apply more to the training operation, which is not exactly what I described above, but similar, and will also use matrix-matrix multiply to adjust weights. For inference operations, it may be also interesting to use even further reduced precision datatypes such as INT8.