Why hasn't CuBLAS implemented a tensor core complex MatMul?

harry.shepherd · September 3, 2024, 10:37am

After finding out that CuBLAS’s complex MatMul functions don’t utillise tensor cores on a GPU, I did some research and found that it can be implemented in other ways.

For example, multiplying two complex numbers A & B using 4 MatMuls such that:

C real = (A real ⋅ B real) + (−A imag ⋅ B imag)
C imag= (A real ⋅ B imag) + (A imag ⋅ Breal)

produces similar timings to CuBLAS for smaller matrices, and can be as much as 3x faster than CuBLAS for larger matrices.
All matrices are 2ⁿ x 2ⁿ in size between n=1 & n=14. Tests performed on an Nvdia A6000.

I was wondering if there’s any reason that the CuBLAS team hasn’t implemented something like this already? I would think there is if they haven’t already done so, just wondering what that reason is.

Many thanks.

mfatica · September 3, 2024, 6:38pm

Not sure about the cublas implementation details of C/ZGEMM , but you could also rewrite:
z=(a+ib)(c+id)=ac-bd+i(ad+bc)
as
z=ac-bd+I[(a+b)(c+d)-ac-bd]
that has less multiplications.

This is known as 3M method ( originally proposed by Ungar in 1963) and it is implemented in CUBLAS (cublasZgemm3m).

This paper has a nice analysis and more implementation details:

rdubtsov · September 4, 2024, 5:34pm

Hello @harry.shepherd and @mfatica.

Tensor cores are used when compute type CUBLAS_COMPUTE_32F_FAST_16F , CUBLAS_COMPUTE_32F_FAST_16BF, CUBLAS_COMPUTE_32F_FAST_TF32.

Also, 3M variants are accessible via cublasCgemm3m() or cublasZgemm3m(), or heuristics can pick them up automatically. The way to disable them is to specify CUBLAS_COMPUTE_32F_PEDANTIC or CUBLAS_COMPUTE_64F_PEDANTIC compute types.

Topic		Replies	Views
Does CUBLAS SGEMM work with tensor cores yet? GPU-Accelerated Libraries	3	1092	February 26, 2020
Tensor Core utilization in cuDSS GPU-Accelerated Libraries cublas , cudss	1	21	March 12, 2025
cuBLAS vs CUDA kernels Performance GPU-Accelerated Libraries	1	1257	September 14, 2020
Can tensor cores be used on non-ideal matrices? and some other questions about tensor cores Deep Learning (Training & Inference) mixed-precision	0	564	February 6, 2020
Benchmark result with vs. without tensor core GPU-Accelerated Libraries	7	37	February 15, 2025
Why is my cublas so slow and is there anything I can do to fix it? CUDA Programming and Performance	1	1472	June 27, 2018
Tensor core boiler plate with cublas, can not compile GPU-Accelerated Libraries cudnn	3	14	February 18, 2025
cuBLAS INT8 tensor core mode vs. FP16 mode GPU-Accelerated Libraries	0	885	February 15, 2019
Tensor Cores Jetson AGX Xavier	8	1306	October 18, 2021
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	935	August 23, 2018

Why hasn't CuBLAS implemented a tensor core complex MatMul?

Related topics