Wmma vs Wgmma On H100 GPU

Hi, I’m experimenting matrix multiplication on H100 using wmma-only FP16 (hgemm, using only half type) tensor core operations and reached only ~50% of cublas performance with optimizations:

  • fragment tiling (up to 8 x 8 fragments per warp in outer-product, 4 warps total)
  • block tiling (up to 16xk of 16x16 wmma tiles from A and B are loaded, into a large area of 16 x 16 tiles for the C tile)
  • efficient memory access for output and input but not TMA
  • async loading of A and B lines (no bank collision on smem and all globals are coalesced & vectorized)

It gets only ~50% of cublas at 4096x4096 square multiplication and 65% at 2048x2048 and 105% at 1024x1024.

Is it wrong to continue with wmma instructions? Or, larger matrices are only meant to be computed with Wgmma instructions? I mean, is H100 optimized for wgmma or wmma?

I tested only with 16x16x16, 32x8x16, and 8x32x16. Some shapes are better than others for loading A and B tiles but they don’t make much difference, it stays around 45-50% of cublas for large matrices.

Would using block-cluster (2-16 blocks per cluster) for multicasting or broadcasting between SM units be enough to get to 75-100% of cublas? I didn’t test cluster yet. I’m loading and storing tiles with only single large chunks instead of many smaller transactions. Would it help if I convert this to TMA?

Please also use Nsight Compute to see whether you have a memory or instruction bottleneck.

1 Like
    Section: GPU Speed Of Light Throughput
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         2.62
    SM Frequency                    Ghz         1.44
    Elapsed Cycles                cycle       596293
    Memory Throughput                 %        65.03
    DRAM Throughput                   %         8.76
    Duration                         us       413.70
    L1/TEX Cache Throughput           %        37.42
    L2 Cache Throughput               %        65.03
    SM Active Cycles              cycle    586565.58
    Compute (SM) Throughput           %        63.98
    ----------------------- ----------- ------------

    INF   Compute and Memory are well-balanced: To reduce runtime, both computation and memory traffic must be reduced.
          Check both the Compute Workload Analysis and Memory Workload Analysis sections.

    Section: GPU Speed Of Light Roofline Chart
    INF   The ratio of peak float (fp32) to double (fp64) performance on this device is 2:1. The workload achieved 0%
          of this device's fp32 peak performance and 0% of its fp64 peak performance. See the Kernel Profiling Guide
          (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#roofline) for more details on roofline          analysis.

    Section: PM Sampling
    ------------------------- ----------- ------------
    Metric Name               Metric Unit Metric Value
    ------------------------- ----------- ------------
    Maximum Buffer Size             Mbyte        50.33
    Dropped Samples                sample            0
    Maximum Sampling Interval          us         1.50
    # Pass Groups                                    2
    ------------------------- ----------- ------------

    Section: Compute Workload Analysis
    -------------------- ----------- ------------
    Metric Name          Metric Unit Metric Value
    -------------------- ----------- ------------
    Executed Ipc Active   inst/cycle         1.81
    Executed Ipc Elapsed  inst/cycle         1.78
    Issue Slots Busy               %        45.25
    Issued Ipc Active     inst/cycle         1.81
    SM Busy                        %        65.01
    -------------------- ----------- ------------

    OPT   Shared is the highest-utilized pipeline (65.0%) based on active cycles, taking into account the rates of its
          different instructions. It is the logical sum of several other pipelines which can't achieve full
          utilization on their own. It executes 64-bit floating point and tensor operations. It's dominated by its
          Tensor (FP) sub-pipeline. The pipeline is well-utilized, but might become a bottleneck if more work is
          added. Based on the number of executed instructions, the highest utilized pipeline (28.5%) is ADU. See the
          Kernel Profiling Guide (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-decoder) or
          hover over the pipeline name to understand the workloads handled by each pipeline. The Instruction
          Statistics section shows the mix of executed instructions for this workload. Check the Warp State Statistics
          section for which reasons cause warps to stall.

    Section: Memory Workload Analysis
    ---------------------------- ----------- ------------
    Metric Name                  Metric Unit Metric Value
    ---------------------------- ----------- ------------
    Memory Throughput                Gbyte/s       293.65
    Mem Busy                               %        65.03
    Max Bandwidth                          %        60.38
    L1/TEX Hit Rate                        %            0
    L2 Compression Success Rate            %            0
    L2 Compression Ratio                   %            0
    L2 Compression Input Sectors                  1051813
    L2 Hit Rate                            %        92.40
    Mem Pipes Busy                         %        28.07
    ---------------------------- ----------- ------------

    Section: Memory Workload Analysis Chart
    OPT   Est. Speedup: 7.231%
          Out of the 33658016.0 bytes sent to the L2 Compression unit only 0.00% were successfully compressed. To
          increase this success rate, consider marking only those memory regions as compressible that contain the most
          zero values and/or expose the most homogeneous values.

    Section: Scheduler Statistics
    ---------------------------- ----------- ------------
    Metric Name                  Metric Unit Metric Value
    ---------------------------- ----------- ------------
    One or More Eligible                   %        45.27
    Issued Warp Per Scheduler                        0.45
    No Eligible                            %        54.73
    Active Warps Per Scheduler          warp         1.94
    Eligible Warps Per Scheduler        warp         0.56
    ---------------------------- ----------- ------------

    OPT   Est. Local Speedup: 34.97%
          Every scheduler is capable of issuing one instruction per cycle, but for this workload each scheduler only
          issues an instruction every 2.2 cycles. This might leave hardware resources underutilized and may lead to
          less optimal performance. Out of the maximum of 16 warps per scheduler, this workload allocates an average
          of 1.94 active warps per scheduler, but only an average of 0.56 warps were eligible per cycle. Eligible
          warps are the subset of active warps that are ready to issue their next instruction. Every cycle with no
          eligible warp results in no instruction being issued and the issue slot remains unused. To increase the
          number of eligible warps, avoid possible load imbalances due to highly different execution durations per
          warp. Reducing stalls indicated on the Warp State Statistics and Source Counters sections can help, too.

    Section: Warp State Statistics
    ---------------------------------------- ----------- ------------
    Metric Name                              Metric Unit Metric Value
    ---------------------------------------- ----------- ------------
    Warp Cycles Per Issued Instruction             cycle         4.29
    Warp Cycles Per Executed Instruction           cycle         4.29
    Avg. Active Threads Per Warp                                   32
    Avg. Not Predicated Off Threads Per Warp                    28.15
    ---------------------------------------- ----------- ------------

    OPT   Est. Speedup: 34.97%
          On average, each warp of this workload spends 1.9 cycles being stalled waiting on a fixed latency execution
          dependency. Typically, this stall reason should be very low and only shows up as a top contributor in
          already highly optimized kernels. Try to hide the corresponding instruction latencies by increasing the
          number of active warps, restructuring the code or unrolling loops. Furthermore, consider switching to
          lower-latency instructions, e.g. by making use of fast math compiler options. This stall type represents
          about 45.2% of the total average of 4.3 cycles between issuing two instructions.
    ----- --------------------------------------------------------------------------------------------------------------
    INF   Check the Warp Stall Sampling (All Samples) table for the top stall locations in your source based on
          sampling data. The Kernel Profiling Guide
          (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-reference) provides more details
          on each stall reason.

    Section: Instruction Statistics
    ---------------------------------------- ----------- ------------
    Metric Name                              Metric Unit Metric Value
    ---------------------------------------- ----------- ------------
    Avg. Executed Instructions Per Scheduler        inst    265390.55
    Executed Instructions                           inst    140126208
    Avg. Issued Instructions Per Scheduler          inst    265398.55
    Issued Instructions                             inst    140130432
    ---------------------------------------- ----------- ------------

    Section: Launch Statistics
    -------------------------------- --------------- ---------------
    Metric Name                          Metric Unit    Metric Value
    -------------------------------- --------------- ---------------
    Block Size                                                   128
    Cluster Scheduling Policy                           PolicySpread
    Cluster Size                                                   0
    Function Cache Configuration                     CachePreferNone
    Grid Size                                                    512
    Registers Per Thread             register/thread             221
    Shared Memory Configuration Size           Kbyte          167.94
    Driver Shared Memory Per Block       Kbyte/block            1.02
    Dynamic Shared Memory Per Block      Kbyte/block           67.58
    Static Shared Memory Per Block        byte/block               0
    # SMs                                         SM             132
    Stack Size                                                  1024
    Threads                                   thread           65536
    # TPCs                                                        66
    Enabled TPC IDs                                              all
    Uses Green Context                                             0
    Waves Per SM                                                1.94
    -------------------------------- --------------- ---------------

    OPT   Est. Speedup: 50%
          A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the
          target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical
          occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 248 thread blocks.
          Under the assumption of a uniform execution duration of all thread blocks, this partial wave may account for
          up to 50.0% of the total runtime of this kernel. Try launching a grid with no partial wave. The overall
          impact of this tail effect also lessens with the number of full waves executed for a grid. See the Hardware
          Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for
          more details on launch configurations.

    Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Max Active Clusters                 cluster            0
    Max Cluster Size                      block            8
    Overall GPU Occupancy                     %            0
    Cluster Occupancy                         %            0
    Block Limit Barriers                  block           32
    Block Limit SM                        block           32
    Block Limit Registers                 block            2
    Block Limit Shared Mem                block            2
    Block Limit Warps                     block           16
    Theoretical Active Warps per SM        warp            8
    Theoretical Occupancy                     %        12.50
    Achieved Occupancy                        %        12.12
    Achieved Active Warps Per SM           warp         7.76
    ------------------------------- ----------- ------------

    OPT   Est. Speedup: 34.97%
          The 2.00 theoretical warps per scheduler this kernel can issue according to its occupancy are below the
          hardware maximum of 16. This kernel's theoretical occupancy (12.5%) is limited by the number of required
          registers, and the required amount of shared memory.

    Section: GPU and Memory Workload Distribution
    -------------------------- ----------- ------------
    Metric Name                Metric Unit Metric Value
    -------------------------- ----------- ------------
    Average DRAM Active Cycles       cycle     94908.20
    Total DRAM Elapsed Cycles        cycle     43335168
    Average L1 Active Cycles         cycle    586565.58
    Total L1 Elapsed Cycles          cycle     78667146
    Average L2 Active Cycles         cycle    700231.46
    Total L2 Elapsed Cycles          cycle     56541440
    Average SM Active Cycles         cycle    586565.58
    Total SM Elapsed Cycles          cycle     78667146
    Average SMSP Active Cycles       cycle    586287.58
    Total SMSP Elapsed Cycles        cycle    314668584
    -------------------------- ----------- ------------

    Section: Source Counters
    ------------------------- ----------- ------------
    Metric Name               Metric Unit Metric Value
    ------------------------- ----------- ------------
    Branch Instructions Ratio           %         0.00
    Branch Instructions              inst       667648
    Branch Efficiency                   %          100
    Avg. Divergent Branches                          0
    ------------------------- ----------- ------------

    OPT   Est. Speedup: 32.57%
          This kernel has uncoalesced global accesses resulting in a total of 25148272 excessive sectors (33% of the
          total 76491184 sectors). Check the L2 Theoretical Sectors Global Excessive table for the primary source
          locations. The CUDA Programming Guide
          (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses) has additional
          information on reducing uncoalesced device memory accesses.
    ----- --------------------------------------------------------------------------------------------------------------
    OPT   Est. Speedup: 39.03%
          This kernel has uncoalesced shared accesses resulting in a total of 25144152 excessive wavefronts (40% of the
          total 63402984 wavefronts). Check the L1 Wavefronts Shared Excessive table for the primary source locations.
          The CUDA Best Practices Guide
           (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#shared-memory-in-matrix-multiplication-c
          -ab) has an example on optimizing shared memory accesses.

I used the following command:

sudo /usr/local/cuda/bin/ncu --set full --section SpeedOfLight --section WarpStateStats --section MemoryWorkloadAnalysis --section LaunchStats --section InstructionStats ./main 4096

It’s 50% of cublas performance at 4096x4096 multiplication.

I think I broke coalescence after making it vectorized by float4 from half2. Due to a warp having half of it on next row of data.

On average, each warp of this workload spends 1.9 cycles being stalled waiting on a fixed latency execution
          dependency.

That could be the wmma instruction not fast enough. If you output the locations you could confirm.

Also your L2 cache is occupied 65%. You can use better coalescing, keep more in L1 (L1 currently not used), shared memory or registers. Otherwise it will be the next bottleneck.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.