Details | Instruction Statics section | Executed Instruction Mix has the count of all instructions. I have copied the full output and brought to the top the instructions to answer your question.
Executed Instructions 2,986,349
FFMA 524,288 Scalar FP32
UTCHMMA 65,536 MMA SRC = FP16, DST = FP32
The grid is performing almost 3M instructions.
~17% of instructions are FP32
~2% of instructions are MMA instructions with src = FP16 and dst = FP32
In order to determine the type of MMA see GPU Speed of Light Throughput section | Roofline and found sm__ops_path_tensor_op_utchmma_src_fp16_dst_fp32_sparsity_off.sum
Executed Warp-Level Instructions By Basic SASS Opcode
Metric,Current
UIADD3,818208.00
SYNCS,649260.00
FFMA,524288.00
BRA,444559.00
NANOSLEEP,280687.00
F2FP,262144.00
LOP3,132326.00
UIMAD,131854.00
UISETP,131446.00
UTMALDG,98304.00
UMOV,81024.00
MOV,71930.00
UTCHMMA,65536.00
STSM,65536.00
ISETP,52066.00
IADD3,43926.00
LDTM,32768.00
BAR,32768.00
PLOP3,18598.00
UTCBAR,16640.00
UTMASTG,16384.00
UTMACMDFLUSH,16384.00
FENCE,16384.00
USHF,13196.00
LDCU,12364.00
NOP,8310.00
UPRMT,6804.00
IMAD,6164.00
LDS,5632.00
ULOP3,4658.00
R2UR,2540.00
UPLOP3,2224.00
VOTEU,2048.00
LDC,320.00
S2UR,270.00
EXIT,268.00
WARPSYNC,256.00
UGETNEXTWORKID,256.00
PRMT,256.00
PMTRIG,250.00
UCGABAR_ARV,160.00
S2R,160.00
UCGABAR_WAIT,120.00
ACQBULK,100.00
UTCATOMSWS,90.00
STS,80.00
MEMBAR,80.00
LDG,80.00
PREEXIT,20.00