Hi, I’m experimenting matrix multiplication on H100 using wmma-only FP16 (hgemm, using only half type) tensor core operations and reached only ~50% of cublas performance with optimizations:
fragment tiling (up to 8 x 8 fragments per warp in outer-product, 4 warps total)
block tiling (up to 16xk of 16x16 wmma tiles from A and B are loaded, into a large area of 16 x 16 tiles for the C tile)
efficient memory access for output and input but not TMA
async loading of A and B lines (no bank collision on smem and all globals are coalesced & vectorized)
It gets only ~50% of cublas at 4096x4096 square multiplication and 65% at 2048x2048 and 105% at 1024x1024.
Is it wrong to continue with wmma instructions? Or, larger matrices are only meant to be computed with Wgmma instructions? I mean, is H100 optimized for wgmma or wmma?
I tested only with 16x16x16, 32x8x16, and 8x32x16. Some shapes are better than others for loading A and B tiles but they don’t make much difference, it stays around 45-50% of cublas for large matrices.
Would using block-cluster (2-16 blocks per cluster) for multicasting or broadcasting between SM units be enough to get to 75-100% of cublas? I didn’t test cluster yet. I’m loading and storing tiles with only single large chunks instead of many smaller transactions. Would it help if I convert this to TMA?
Section: GPU Speed Of Light Throughput
----------------------- ----------- ------------
Metric Name Metric Unit Metric Value
----------------------- ----------- ------------
DRAM Frequency Ghz 2.62
SM Frequency Ghz 1.44
Elapsed Cycles cycle 596293
Memory Throughput % 65.03
DRAM Throughput % 8.76
Duration us 413.70
L1/TEX Cache Throughput % 37.42
L2 Cache Throughput % 65.03
SM Active Cycles cycle 586565.58
Compute (SM) Throughput % 63.98
----------------------- ----------- ------------
INF Compute and Memory are well-balanced: To reduce runtime, both computation and memory traffic must be reduced.
Check both the Compute Workload Analysis and Memory Workload Analysis sections.
Section: GPU Speed Of Light Roofline Chart
INF The ratio of peak float (fp32) to double (fp64) performance on this device is 2:1. The workload achieved 0%
of this device's fp32 peak performance and 0% of its fp64 peak performance. See the Kernel Profiling Guide
(https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#roofline) for more details on roofline analysis.
Section: PM Sampling
------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------- ----------- ------------
Maximum Buffer Size Mbyte 50.33
Dropped Samples sample 0
Maximum Sampling Interval us 1.50
# Pass Groups 2
------------------------- ----------- ------------
Section: Compute Workload Analysis
-------------------- ----------- ------------
Metric Name Metric Unit Metric Value
-------------------- ----------- ------------
Executed Ipc Active inst/cycle 1.81
Executed Ipc Elapsed inst/cycle 1.78
Issue Slots Busy % 45.25
Issued Ipc Active inst/cycle 1.81
SM Busy % 65.01
-------------------- ----------- ------------
OPT Shared is the highest-utilized pipeline (65.0%) based on active cycles, taking into account the rates of its
different instructions. It is the logical sum of several other pipelines which can't achieve full
utilization on their own. It executes 64-bit floating point and tensor operations. It's dominated by its
Tensor (FP) sub-pipeline. The pipeline is well-utilized, but might become a bottleneck if more work is
added. Based on the number of executed instructions, the highest utilized pipeline (28.5%) is ADU. See the
Kernel Profiling Guide (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-decoder) or
hover over the pipeline name to understand the workloads handled by each pipeline. The Instruction
Statistics section shows the mix of executed instructions for this workload. Check the Warp State Statistics
section for which reasons cause warps to stall.
Section: Memory Workload Analysis
---------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
---------------------------- ----------- ------------
Memory Throughput Gbyte/s 293.65
Mem Busy % 65.03
Max Bandwidth % 60.38
L1/TEX Hit Rate % 0
L2 Compression Success Rate % 0
L2 Compression Ratio % 0
L2 Compression Input Sectors 1051813
L2 Hit Rate % 92.40
Mem Pipes Busy % 28.07
---------------------------- ----------- ------------
Section: Memory Workload Analysis Chart
OPT Est. Speedup: 7.231%
Out of the 33658016.0 bytes sent to the L2 Compression unit only 0.00% were successfully compressed. To
increase this success rate, consider marking only those memory regions as compressible that contain the most
zero values and/or expose the most homogeneous values.
Section: Scheduler Statistics
---------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
---------------------------- ----------- ------------
One or More Eligible % 45.27
Issued Warp Per Scheduler 0.45
No Eligible % 54.73
Active Warps Per Scheduler warp 1.94
Eligible Warps Per Scheduler warp 0.56
---------------------------- ----------- ------------
OPT Est. Local Speedup: 34.97%
Every scheduler is capable of issuing one instruction per cycle, but for this workload each scheduler only
issues an instruction every 2.2 cycles. This might leave hardware resources underutilized and may lead to
less optimal performance. Out of the maximum of 16 warps per scheduler, this workload allocates an average
of 1.94 active warps per scheduler, but only an average of 0.56 warps were eligible per cycle. Eligible
warps are the subset of active warps that are ready to issue their next instruction. Every cycle with no
eligible warp results in no instruction being issued and the issue slot remains unused. To increase the
number of eligible warps, avoid possible load imbalances due to highly different execution durations per
warp. Reducing stalls indicated on the Warp State Statistics and Source Counters sections can help, too.
Section: Warp State Statistics
---------------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
---------------------------------------- ----------- ------------
Warp Cycles Per Issued Instruction cycle 4.29
Warp Cycles Per Executed Instruction cycle 4.29
Avg. Active Threads Per Warp 32
Avg. Not Predicated Off Threads Per Warp 28.15
---------------------------------------- ----------- ------------
OPT Est. Speedup: 34.97%
On average, each warp of this workload spends 1.9 cycles being stalled waiting on a fixed latency execution
dependency. Typically, this stall reason should be very low and only shows up as a top contributor in
already highly optimized kernels. Try to hide the corresponding instruction latencies by increasing the
number of active warps, restructuring the code or unrolling loops. Furthermore, consider switching to
lower-latency instructions, e.g. by making use of fast math compiler options. This stall type represents
about 45.2% of the total average of 4.3 cycles between issuing two instructions.
----- --------------------------------------------------------------------------------------------------------------
INF Check the Warp Stall Sampling (All Samples) table for the top stall locations in your source based on
sampling data. The Kernel Profiling Guide
(https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-reference) provides more details
on each stall reason.
Section: Instruction Statistics
---------------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
---------------------------------------- ----------- ------------
Avg. Executed Instructions Per Scheduler inst 265390.55
Executed Instructions inst 140126208
Avg. Issued Instructions Per Scheduler inst 265398.55
Issued Instructions inst 140130432
---------------------------------------- ----------- ------------
Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name Metric Unit Metric Value
-------------------------------- --------------- ---------------
Block Size 128
Cluster Scheduling Policy PolicySpread
Cluster Size 0
Function Cache Configuration CachePreferNone
Grid Size 512
Registers Per Thread register/thread 221
Shared Memory Configuration Size Kbyte 167.94
Driver Shared Memory Per Block Kbyte/block 1.02
Dynamic Shared Memory Per Block Kbyte/block 67.58
Static Shared Memory Per Block byte/block 0
# SMs SM 132
Stack Size 1024
Threads thread 65536
# TPCs 66
Enabled TPC IDs all
Uses Green Context 0
Waves Per SM 1.94
-------------------------------- --------------- ---------------
OPT Est. Speedup: 50%
A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the
target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical
occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 248 thread blocks.
Under the assumption of a uniform execution duration of all thread blocks, this partial wave may account for
up to 50.0% of the total runtime of this kernel. Try launching a grid with no partial wave. The overall
impact of this tail effect also lessens with the number of full waves executed for a grid. See the Hardware
Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for
more details on launch configurations.
Section: Occupancy
------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------- ----------- ------------
Max Active Clusters cluster 0
Max Cluster Size block 8
Overall GPU Occupancy % 0
Cluster Occupancy % 0
Block Limit Barriers block 32
Block Limit SM block 32
Block Limit Registers block 2
Block Limit Shared Mem block 2
Block Limit Warps block 16
Theoretical Active Warps per SM warp 8
Theoretical Occupancy % 12.50
Achieved Occupancy % 12.12
Achieved Active Warps Per SM warp 7.76
------------------------------- ----------- ------------
OPT Est. Speedup: 34.97%
The 2.00 theoretical warps per scheduler this kernel can issue according to its occupancy are below the
hardware maximum of 16. This kernel's theoretical occupancy (12.5%) is limited by the number of required
registers, and the required amount of shared memory.
Section: GPU and Memory Workload Distribution
-------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
-------------------------- ----------- ------------
Average DRAM Active Cycles cycle 94908.20
Total DRAM Elapsed Cycles cycle 43335168
Average L1 Active Cycles cycle 586565.58
Total L1 Elapsed Cycles cycle 78667146
Average L2 Active Cycles cycle 700231.46
Total L2 Elapsed Cycles cycle 56541440
Average SM Active Cycles cycle 586565.58
Total SM Elapsed Cycles cycle 78667146
Average SMSP Active Cycles cycle 586287.58
Total SMSP Elapsed Cycles cycle 314668584
-------------------------- ----------- ------------
Section: Source Counters
------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------- ----------- ------------
Branch Instructions Ratio % 0.00
Branch Instructions inst 667648
Branch Efficiency % 100
Avg. Divergent Branches 0
------------------------- ----------- ------------
OPT Est. Speedup: 32.57%
This kernel has uncoalesced global accesses resulting in a total of 25148272 excessive sectors (33% of the
total 76491184 sectors). Check the L2 Theoretical Sectors Global Excessive table for the primary source
locations. The CUDA Programming Guide
(https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses) has additional
information on reducing uncoalesced device memory accesses.
----- --------------------------------------------------------------------------------------------------------------
OPT Est. Speedup: 39.03%
This kernel has uncoalesced shared accesses resulting in a total of 25144152 excessive wavefronts (40% of the
total 63402984 wavefronts). Check the L1 Wavefronts Shared Excessive table for the primary source locations.
The CUDA Best Practices Guide
(https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#shared-memory-in-matrix-multiplication-c
-ab) has an example on optimizing shared memory accesses.
On average, each warp of this workload spends 1.9 cycles being stalled waiting on a fixed latency execution
dependency.
That could be the wmma instruction not fast enough. If you output the locations you could confirm.
Also your L2 cache is occupied 65%. You can use better coalescing, keep more in L1 (L1 currently not used), shared memory or registers. Otherwise it will be the next bottleneck.