Question on "Matrix Multiplication Background User's Guide"

Referring to the the Matrix Multiplication Background User’s Guide. For the equation used for Arithmetic Intensity, should it not be the following?

MN(2K -1)/(MN*(3*K-1))

instead of

MNK/(MK+KN+M*N)

The formula in the User’s Guide seems correct, and the reasoning behind the formula that is provided in the preceding paragraph also seems correct.

Would be able to provide more details ? Something along the following line ?

c([i,j] = sum (a[i,k]*b[k,j]) → K multiplications and (K-1) additions → 2K-1 operations
→ 2K reads and 1 write → 2K+1 ( assuming accumulation result is in a register , requiring only the final result to be written to memory)

Hence, AI = MN(2K-1)/2MN(2K+1) = (2K-1)/2(2K+1)

Do we agree on the starting point:

                  # of FLOPS   2*(M*N*K)
Arith. Inten. =  ----------- = ---------------
                  # of bytes   2*(M*K+N*K+M*N) 

?

The numerator is the number of flops in a matrix multiply (e.g. 19.1.3)

The denominator is the cost to load or store the matrices. For

 C=A*B

A is a MxK matrix, B is a KxN matrix, C is a MxN matrix. For A we are loading MxK elements. For B we are loading KxN elements. For C we are storing MxN elements. Each element is 2 bytes.

Even if we use your modification that considers one less flop per element in C, this has no impact on the number of bytes read or wriitten.

AI = MN(2K-1)/2*(MN+MK+KN)

There isn’t any way to do a matrix multiply without reading all the bytes of A and B, and writing all the bytes of C. One of your errors seems to be in dropping the 2 multiplier for bytes per element in the denominator. Another error seems to be in multiple-counting of the load of elements from A and B. We need only load a particular element once. Your method assumes we will load A[0,j] for each element of C[0,i]. Calculating the first row in C only requires that the first row of A be loaded once, not once per element in the first row of C.

Thanks for explaining the denominator and pointing out the missing 2 for 16 bit operation . The theoretical minimum of memory access is indeed reading every element of A and B once, and writing every element of C once. This ideal condition requires loading whole of A and B into register/cache, does it not ? Large A and B is not going to fit. How close do implementations come in practice ? GEMM’s are very well optimized now a days, hence very close to the theoretical ?

I don’t know. You could start to answer that with a profiler. You can use a profiler (e.g. nsight compute) to tell you how many bytes were read from and how many bytes were written to global memory, and try that on a reasonable sized cublasSgemm, for example.

You could start by thinking about what happens, if you cut A or B, depending on whether M or N is larger in half. Then you have to read the non-cut matrix twice, but have the same amount of read and writes for the cut matrix and for C, and the same number of FLOPS. This you can do iteratively. It would create an additional M or N or M*N factor for very large matrices.

Alternatively you can cut the K dimension, then A and B has to be read as much as before, but the output is written twice. Or (for the second writing the first one has to be read and added, which the tensor cores can do as part of their operation). It adds to the number of FLOPS and only to the operations on C.

All in all you probably would get a M*N*K in the denominator for very large matrices, e.g. starting at tens or hundreds of TB sizes (before it already gradually worsens). But then you have to consider secondary storage speeds or upgrading your compute center.

H100 has up to 227KB per SM of usable shared memory. Register space per SM amounts to 256KB. For the SXM variant, 132 SMs. The total on-chip “addressable” memory is therefore over 60MB. A highly optimized GEMM code would probably seek to make best use of that space. For FP16, up to 4Kx4K tile or more could fit on-chip. And this says nothing about caching, which if carefully used could add another 60MB (L2). It’s perhaps conceivable that a 4Kx4K FP16 matrix multiply could fit entirely on chip, necessitating only a single load of A and B (from off-chip DRAM), and a single store (to global/DRAM) for C.

For H100, largest hypothetical 16 bit square matrices that can fit into the SM+RF is sqrt( (227+256)1024132/(3*2) ) = 3298, right ?

The L1, RF and SM for V100 are 96K, 256K, and 80 respectively. Similarly, the largest 16 bit hypothetical square matrices is sqrt( (227+256)1024132/(3*2) ) = 2192.

As an aside, that is only an improvement of 3298/2192=1.5x from V100 to H100 ! Whereas, the 16 bit TOPS has gone up from 125 to 1979 or 15.8x. Hence Arithmetic Intensity ( AI) or Memory Bandwidth needs to be 10x higher for H100 cores/threads to be as busy as V100. The memory bandwidth of V100 and H100 SXM are 0.9 TB/s and 3.35 TB/s - or 3.7x . Hence AI has to be 10/3.7 = 2.68x higher for H100 cores to be as busy as V100… Interestingly, as AI for GEMMs is O(N) , the 1.5x increase in internal memory falls short of that.

This article has details of tiling in CUTLASS GEMM’s . Although the article was written for V100, it applicable to H100/200 I think. Is there a similar article for H100 ?