Why my kernel only show in some Roofline graph

I benchmarked my kernel with ncu on Jetson Thor and got the report. The kernel is simply using fp16 to do matrix multiplication.

In Tensor Core Roofline and Floating point (single precision) Roofline, I can see the dot point for my kernel. However, in Floating point (Half precision), I cannot see the dot point, while I’m using half for the GeMM. Why is that?

Hi, @quan.luo.101

Can you please provide the report then we can help to check why it is not shown ? Thanks !

Hi @veraj thanks for replying. I don’t have the exact one… it’s deleted but here’s another one. Does that mean that some of the FP32 is using cuda cores?

I changed the suffix to pdf and uploaded here. You can change back to ncu-rep and give it a try.

Another question I have here is that why this time I only have one dot point for L2 achieved value? Where’s the L1 achieved value?

thor_fp16_acc32.pdf (26.8 MB)

Details | Instruction Statics section | Executed Instruction Mix has the count of all instructions. I have copied the full output and brought to the top the instructions to answer your question.

Executed Instructions 2,986,349
FFMA                    524,288   Scalar FP32
UTCHMMA                  65,536   MMA SRC = FP16, DST = FP32

The grid is performing almost 3M instructions.
~17% of instructions are FP32
~2% of instructions are MMA instructions with src = FP16 and dst = FP32

In order to determine the type of MMA see GPU Speed of Light Throughput section | Roofline and found sm__ops_path_tensor_op_utchmma_src_fp16_dst_fp32_sparsity_off.sum

Executed Warp-Level Instructions By Basic SASS Opcode
Metric,Current
UIADD3,818208.00
SYNCS,649260.00
FFMA,524288.00
BRA,444559.00
NANOSLEEP,280687.00
F2FP,262144.00
LOP3,132326.00
UIMAD,131854.00
UISETP,131446.00
UTMALDG,98304.00
UMOV,81024.00
MOV,71930.00
UTCHMMA,65536.00
STSM,65536.00
ISETP,52066.00
IADD3,43926.00
LDTM,32768.00
BAR,32768.00
PLOP3,18598.00
UTCBAR,16640.00
UTMASTG,16384.00
UTMACMDFLUSH,16384.00
FENCE,16384.00
USHF,13196.00
LDCU,12364.00
NOP,8310.00
UPRMT,6804.00
IMAD,6164.00
LDS,5632.00
ULOP3,4658.00
R2UR,2540.00
UPLOP3,2224.00
VOTEU,2048.00
LDC,320.00
S2UR,270.00
EXIT,268.00
WARPSYNC,256.00
UGETNEXTWORKID,256.00
PRMT,256.00
PMTRIG,250.00
UCGABAR_ARV,160.00
S2R,160.00
UCGABAR_WAIT,120.00
ACQBULK,100.00
UTCATOMSWS,90.00
STS,80.00
MEMBAR,80.00
LDG,80.00
PREEXIT,20.00

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.