Strange FP16 GEMM aPeak Performance & RTX3090

I was checking the FP16 GEMM peak performance of RTX3090 and it was strange to me to see 50% drop for mid size matrices.

%   M     N     K   GPU Gflop/s (ms)      GPU error
%========================================================================================================
 1024  1024  1024   2635.23 (   0.81)       ---
 2048  2048  2048   58631.08 (   0.29)       ---
 3072  3072  3072   92188.92 (   0.63)       ---
 4096  4096  4096   113209.10 (   1.21)       ---
 5120  5120  5120   123589.45 (   2.17)       ---
 6144  6144  6144   128241.71 (   3.62)       ---
 7168  7168  7168   115558.98 (   6.37)       ---
 8192  8192  8192   104684.95 (  10.50)       ---
 9216  9216  9216   93158.09 (  16.80)       ---
10240 10240 10240   87832.27 (  24.45)       ---
11264 11264 11264   84352.97 (  33.89)       ---
12288 12288 12288   107240.38 (  34.60)       ---
13312 13312 13312   96272.52 (  49.01)       ---
14336 14336 14336   103458.65 (  56.96)       ---
15360 15360 15360   99511.91 (  72.83)       ---
16384 16384 16384   72890.14 ( 120.68)       ---
17408 17408 17408   87728.52 ( 120.26)       ---
18432 18432 18432   69442.68 ( 180.35)       ---
19456 19456 19456   69949.03 ( 210.58)       ---
20480 20480 20480   68355.85 ( 251.33)       ---
21504 21504 21504   67744.31 ( 293.57)       ---
22528 22528 22528   67491.17 ( 338.81)       ---
23552 23552 23552   66234.61 ( 394.48)       ---
24576 24576 24576   70176.79 ( 423.03)       ---
25600 25600 25600   72157.42 ( 465.02)       ---
26624 26624 26624   73832.88 ( 511.21)       ---
27648 27648 27648   78171.55 ( 540.72)       ---
28672 28672 28672   71223.29 ( 661.88)       ---
29696 29696 29696   70045.31 ( 747.73)       ---
30720 30720 30720   69575.75 ( 833.37)       ---
31744 31744 31744   69425.44 ( 921.50)       ---
32768 32768 32768   69352.25 (1014.66)       ---
33792 33792 33792   103900.57 ( 742.77)       ---
34816 34816 34816   103183.34 ( 818.01)       ---
35840 35840 35840   82927.16 (1110.29)       ---
36864 36864 36864   91454.85 (1095.55)       ---
37888 37888 37888   80549.89 (1350.42)       ---
38912 38912 38912   92057.92 (1280.03)       ---
39936 39936 39936   98316.55 (1295.68)       ---
40960 40960 40960   97357.48 (1411.69)       ---

The kernek name for 35k is :

trace_35_k_z

For this test I am using MAGMA and magma_hgemm routinewhich is actually a simple wrapper around cuBLAS.

Does it related to the GPU memory?

It’s common to use a single kernel for multiple matrix configurations and it’s possible that the kernel isn’t optimal for a particular size for given hardware. This is true for most math libs.