How does it compute exactly in Tensor Core?

Robert_Crovella · May 28, 2024, 2:12pm

The TC unit of cc7.0 devices was designed to have that kind of hardware behavior, for at least one of the supported hardware paths. “Why is it that way?” questions can be difficult to answer. You will find with a bit of searching that the performance characteristic of that particular tensor op form varies significantly depending on GPU arch, so it seems clear to me that the GPU designers had different ideas as GPU development and TC development progressed.

The calculation produces a (four) m8n8k4 matrix-matrix multiply. There is only one correct behavior for that statement. Beyond that, I don’t know of any detailed specifications for TC unit behavior.

Here are some questions that may be of interest: 1 2 3

Topic		Replies	Views
Yet another performance question CUDA Programming and Performance	16	4238	February 12, 2009
Using Tensor Cores in CUDA Fortran Technical Blog	1	443	March 7, 2025
Error or incomprehension, MMa ptx mixed precision Bfloat16 rtx3080 CUDA Programming and Performance	20	2279	October 12, 2021
Counting FLOPS based on SASS code. CUDA Programming and Performance	2	1016	September 27, 2016
PTX instruction `mma` not lowered to tensor core related SASS instruction TensorRT	2	1317	March 22, 2022
Is there a document about in which hardware unit(ie. ALU FMU...) an instruction is executed? CUDA Programming and Performance	35	3092	October 5, 2022
Wrong answer with mma.sync.aligned.m8n8k4 CUDA Programming and Performance cuda , kernel	8	1336	April 17, 2023
Concurrent execution of CUDA and Tensor cores CUDA Programming and Performance	34	8492	November 3, 2024
I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada? CUDA Programming and Performance cuda , tensorflow , rtx , ampere	10	1989	September 27, 2024
cuBLAS INT8 tensor core mode vs. FP16 mode GPU-Accelerated Libraries cublas	13	5500	December 5, 2022

How does it compute exactly in Tensor Core?

Related topics