I am checking the occupancy of GEMM and I am seeing that all the time the occupancy is 12.5% (V100s and profiling with nsys). What does it mean? Actually, we know that GEMM is very well optimized so what is happening for the rest which is 87.5%. Just I want to understand it.
Anyway, CUBLAS gemm kernels tend to be fairly high on register usage, which will often knock their theoretical occupancy down to 50% or 25% or so. ncu will tell you this directly. You may also be launching a gemm size that is too small to hit the 50% or 25% number.
Here’s a portion of ncu output from running a cublasSgemm call on matrices of size 1024x1024 on a V100:
volta_sgemm_32x128_nn, 2022-Dec-20 12:17:49, Context 1, Stream 7
Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/usecond 864.36
SM Frequency cycle/nsecond 1.21
Elapsed Cycles cycle 257,673
Memory [%] % 76.43
DRAM Throughput % 12.28
Duration usecond 212.45
L1/TEX Cache Throughput % 79.45
L2 Cache Throughput % 46.95
SM Active Cycles cycle 247,750.48
Compute (SM) [%] % 82.69
---------------------------------------------------------------------- --------------- ------------------------------
INF The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To
further improve performance, work will likely need to be shifted from the most utilized to another unit.
Start by analyzing workloads in the Compute Workload Analysis section.
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 256
Function Cache Configuration cudaFuncCachePreferNone
Grid Size 768
Registers Per Thread register/thread 57
Shared Memory Configuration Size Kbyte 65.54
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block Kbyte/block 16.38
Threads thread 196,608
Waves Per SM 2.40
---------------------------------------------------------------------- --------------- ------------------------------
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 32
Block Limit Registers block 4
Block Limit Shared Mem block 6
Block Limit Warps block 8
Theoretical Active Warps per SM warp 32
Theoretical Occupancy % 50
Achieved Occupancy % 44.75
Achieved Active Warps Per SM warp 28.64
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel's theoretical occupancy (50.0%) is limited by the number of required registers
Registers per thread: 234
That is the proximal limiter to occupancy. A Volta SM has 65536 registers, so at 234 register/thread you can support at most 280 threads on a SM. The practical/efficient limit is therefore 256 threads per SM, and compared to a maximum of 2048 threads per SM, that is 12.5%.
A Dgemm operation is going to be dominated by DFMA operations, which are serviced by a limited number of units in the Volta SM. So having full occupancy wouldn’t really be a benefit - it does not require 64 warps/SM to keep the DFMA units busy. Therefore the CUBLAS designers haven’t prioritized maximum occupancy, instead favoring increased shared memory and increased register usage. Also note that as a Dgemm is 64-bit FP arithmetic, the arithmetic registers will all require 64-bits, or two ordinary registers, so the register pressure is “double” the amount that you might need in a similar Sgemm case. This partly explains the high 234 number.