How to choose size of tensor core?

202476410arsmart · October 6, 2023, 8:34am

In the table, different data types have different acceptable sizes. How is the computation time converted accordingly? For example, under the same data type, would choosing 8x8x16 take twice as long as 8x8x8? How does the time consumption vary across different data types?

Robert_Crovella · October 6, 2023, 2:36pm

For the most part, NVIDIA doesn’t spell out the latencies of instructions (how long - how many cycles - does it take from the cycle that I issue an instruction in, to the cycle that the results are first ready in), and this is also true for tensorcore instructions.

However when these instructions are issued in bulk (lots of them, carefully designed, device-wide) then the overall aggregate throughput should be comparable to stated maximums. You can find stated throughput maximums (these are peak theoretical numbers, not achievable in practice) in the various GPU white papers.

Another thing to point out is that tensorcore ops come in categories based on the input data type. The supported tensorcore ops currently include input data types like FP64, TF32, BF16, FP16, INT8, INT4, INT1. These datatypes are all specified and discussion in the documentation, in forum articles, and in public specifications.

A third thing to point out as a rule of thumb (I’m not stating this as a universal, specified guarantee, just something that is generally true) is that instructions that involve the same input type often make use of the TC unit in the same way, and can be assumed to “take the same amount of time”. So I would not assume there is any throughput difference, or latency difference, between a 16x16 INT8 TC op, and a 32x8 INT8 TC op, on a GPU that supports those.

So the __half data type corresponds to FP16, unsigned char and signed char are both INT8 variants.

Picking an example, let’s consider the GA102 white paper, e.g. p47, we can see that INT8 throughput is generally twice higher than FP16 throughput with FP16 accumulate (i.e FP16 for both input and output data types. So on a RTX3070, I would expect that an unsigned char or signed char input TC op could be done at twice the rate of a __half input __half output FP16 TC op. Furthermore, studying that table, for the RTX3070 GPU, you will note that choosing float output for a FP16 TC op cuts the throughput in half.

That sort of process is about the extent of the information I would normally try to use to judge comparison of performance for these different types of TC ops.

With a bit of effort and research, you can make a few more observations, but I’m not sure they meaningfully educate the programmer any further than the relative comparison methodology I outlined here.

The reason that the programmer may be offered both a FP16 32x8 TC op as well as a 16x16 TC op to choose from have to do with their utility for different types of problems and matrix shapes.

Topic		Replies	Views
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2511	August 12, 2017
Question about tensor cores performance CUDA Programming and Performance	3	573	October 12, 2021
FP16 support on gtx 1060 and 1080 GPU-Accelerated Libraries math-api	14	24872	May 19, 2021
About tensor core's flops/clk and wmma shape? CUDA Programming and Performance	1	832	October 22, 2023
How to choose from so many `mma.mnk*` instructions? CUDA Programming and Performance	3	206	June 19, 2024
How to measure Tensor FLOPs? CUDA Programming and Performance tensorrt , cuda , kernel	14	1567	May 15, 2024
cuBLAS INT8 tensor core mode vs. FP16 mode GPU-Accelerated Libraries	0	882	February 15, 2019
The "GPU Compute Time" doesn't change, when setting different batch size TensorRT tensorrt	3	1162	July 8, 2022
Why tensor cores can't do FP32 arithmetic? CUDA Programming and Performance hw	4	81	December 10, 2024
Type conversion throughput/latency CUDA Programming and Performance	5	439	February 3, 2024

How to choose size of tensor core?

Related topics