Why Occupancy of GEMM is 12.5%


I am checking the occupancy of GEMM and I am seeing that all the time the occupancy is 12.5% (V100s and profiling with nsys). What does it mean? Actually, we know that GEMM is very well optimized so what is happening for the rest which is 87.5%. Just I want to understand it.

I would normally use ncu for occupancy, not nsys.

Anyway, CUBLAS gemm kernels tend to be fairly high on register usage, which will often knock their theoretical occupancy down to 50% or 25% or so. ncu will tell you this directly. You may also be launching a gemm size that is too small to hit the 50% or 25% number.

Here’s a portion of ncu output from running a cublasSgemm call on matrices of size 1024x1024 on a V100:

  volta_sgemm_32x128_nn, 2022-Dec-20 12:17:49, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/usecond                         864.36
    SM Frequency                                                             cycle/nsecond                           1.21
    Elapsed Cycles                                                                   cycle                        257,673
    Memory [%]                                                                           %                          76.43
    DRAM Throughput                                                                      %                          12.28
    Duration                                                                       usecond                         212.45
    L1/TEX Cache Throughput                                                              %                          79.45
    L2 Cache Throughput                                                                  %                          46.95
    SM Active Cycles                                                                 cycle                     247,750.48
    Compute (SM) [%]                                                                     %                          82.69
    ---------------------------------------------------------------------- --------------- ------------------------------
    INF   The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To
          further improve performance, work will likely need to be shifted from the most utilized to another unit.
          Start by analyzing workloads in the Compute Workload Analysis section.

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        256
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         768
    Registers Per Thread                                                   register/thread                             57
    Shared Memory Configuration Size                                                 Kbyte                          65.54
    Driver Shared Memory Per Block                                              byte/block                              0
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                             Kbyte/block                          16.38
    Threads                                                                         thread                        196,608
    Waves Per SM                                                                                                     2.40
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             32
    Block Limit Registers                                                            block                              4
    Block Limit Shared Mem                                                           block                              6
    Block Limit Warps                                                                block                              8
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                             50
    Achieved Occupancy                                                                   %                          44.75
    Achieved Active Warps Per SM                                                      warp                          28.64
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (50.0%) is limited by the number of required registers
1 Like

Thank you. Actually, I do not have enough permissions (sudo) to run ncu.

Here I am doing GEMM for 10Kx10K.

But for small size also it is 12.5%.

Registers per thread: 234
That is the proximal limiter to occupancy. A Volta SM has 65536 registers, so at 234 register/thread you can support at most 280 threads on a SM. The practical/efficient limit is therefore 256 threads per SM, and compared to a maximum of 2048 threads per SM, that is 12.5%.

A Dgemm operation is going to be dominated by DFMA operations, which are serviced by a limited number of units in the Volta SM. So having full occupancy wouldn’t really be a benefit - it does not require 64 warps/SM to keep the DFMA units busy. Therefore the CUBLAS designers haven’t prioritized maximum occupancy, instead favoring increased shared memory and increased register usage. Also note that as a Dgemm is 64-bit FP arithmetic, the arithmetic registers will all require 64-bits, or two ordinary registers, so the register pressure is “double” the amount that you might need in a similar Sgemm case. This partly explains the high 234 number.