After finding out that CuBLAS’s complex MatMul functions don’t utillise tensor cores on a GPU, I did some research and found that it can be implemented in other ways.
For example, multiplying two complex numbers A & B using 4 MatMuls such that:
C real = (A real ⋅ B real) + (−A imag ⋅ B imag)
C imag= (A real ⋅ B imag) + (A imag ⋅ Breal)
produces similar timings to CuBLAS for smaller matrices, and can be as much as 3x faster than CuBLAS for larger matrices.
All matrices are 2ⁿ x 2ⁿ in size between n=1 & n=14. Tests performed on an Nvdia A6000.
I was wondering if there’s any reason that the CuBLAS team hasn’t implemented something like this already? I would think there is if they haven’t already done so, just wondering what that reason is.
Not sure about the cublas implementation details of C/ZGEMM , but you could also rewrite:
z=(a+ib)(c+id)=ac-bd+i(ad+bc)
as
z=ac-bd+I[(a+b)(c+d)-ac-bd]
that has less multiplications.
This is known as 3M method ( originally proposed by Ungar in 1963) and it is implemented in CUBLAS (cublasZgemm3m).
This paper has a nice analysis and more implementation details:
Tensor cores are used when compute type CUBLAS_COMPUTE_32F_FAST_16F , CUBLAS_COMPUTE_32F_FAST_16BF, CUBLAS_COMPUTE_32F_FAST_TF32.
Also, 3M variants are accessible via cublasCgemm3m() or cublasZgemm3m(), or heuristics can pick them up automatically. The way to disable them is to specify CUBLAS_COMPUTE_32F_PEDANTIC or CUBLAS_COMPUTE_64F_PEDANTIC compute types.