How to choose size of tensor core?


In the table, different data types have different acceptable sizes. How is the computation time converted accordingly? For example, under the same data type, would choosing 8x8x16 take twice as long as 8x8x8? How does the time consumption vary across different data types?

For the most part, NVIDIA doesn’t spell out the latencies of instructions (how long - how many cycles - does it take from the cycle that I issue an instruction in, to the cycle that the results are first ready in), and this is also true for tensorcore instructions.

However when these instructions are issued in bulk (lots of them, carefully designed, device-wide) then the overall aggregate throughput should be comparable to stated maximums. You can find stated throughput maximums (these are peak theoretical numbers, not achievable in practice) in the various GPU white papers.

Another thing to point out is that tensorcore ops come in categories based on the input data type. The supported tensorcore ops currently include input data types like FP64, TF32, BF16, FP16, INT8, INT4, INT1. These datatypes are all specified and discussion in the documentation, in forum articles, and in public specifications.

A third thing to point out as a rule of thumb (I’m not stating this as a universal, specified guarantee, just something that is generally true) is that instructions that involve the same input type often make use of the TC unit in the same way, and can be assumed to “take the same amount of time”. So I would not assume there is any throughput difference, or latency difference, between a 16x16 INT8 TC op, and a 32x8 INT8 TC op, on a GPU that supports those.

So the __half data type corresponds to FP16, unsigned char and signed char are both INT8 variants.

Picking an example, let’s consider the GA102 white paper, e.g. p47, we can see that INT8 throughput is generally twice higher than FP16 throughput with FP16 accumulate (i.e FP16 for both input and output data types. So on a RTX3070, I would expect that an unsigned char or signed char input TC op could be done at twice the rate of a __half input __half output FP16 TC op. Furthermore, studying that table, for the RTX3070 GPU, you will note that choosing float output for a FP16 TC op cuts the throughput in half.

That sort of process is about the extent of the information I would normally try to use to judge comparison of performance for these different types of TC ops.

With a bit of effort and research, you can make a few more observations, but I’m not sure they meaningfully educate the programmer any further than the relative comparison methodology I outlined here.

The reason that the programmer may be offered both a FP16 32x8 TC op as well as a 16x16 TC op to choose from have to do with their utility for different types of problems and matrix shapes.