Can we directly use register value for tensor core calculation?

Robert_Crovella · October 18, 2023, 5:10am

Here are some examples that don’t use shared: 1 2 3

Those examples are using “register for mma”. Of course register data has to come from somewhere. So if you want to, you can load data into a register from global, or local, or shared, and then pass those registers directly to the PTX mma instruction, more-or-less as the examples I linked indicate.

wmma is the earliest exposure of tensorcore ops, e.g. in the v100 timeframe. As tensorcore added variety, a new instruction format (mma) was added. An example of a difference is given here

Topic		Replies	Views
problem about tensor core CUDA Programming and Performance	2	721	June 28, 2018
Direct access to Volta HMMA instruction CUDA Programming and Performance	9	5377	December 19, 2017
PTX instruction `mma` not lowered to tensor core related SASS instruction TensorRT	2	1362	March 22, 2022
Turing 16x16 MMA, SM usage, 1 or 2? CUDA Programming and Performance	2	1063	December 8, 2018
How to achieve peak tensor core utilization TensorRT	1	896	September 20, 2022
Help, Tensor core programming CUDA Developer Tools	0	311	June 18, 2020
CUDA tensor core register mapping? CUDA Programming and Performance	5	1028	January 26, 2024
How to use WMMA efficiently CUDA Programming and Performance	4	8661	October 23, 2020
Tensor Core Programming Using CUDA Fortran Technical Blog	0	441	August 25, 2020
Error while using different precisions with tensor cores Deep Learning (Training & Inference) mixed-precision	2	2032	July 26, 2019

Can we directly use register value for tensor core calculation?

Related topics