complex FP16 tensor core GEMMs

gary.ballantyne · February 4, 2020, 1:27am

Hello

If someone knows the best (easiest to code) way to do a half-precision GEMM using tensor cores, I’d really appreciate any help.

It seems that, about a year ago, this wasn’t possible in cutlass (page 4):

https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9306-extreme-signal-processing-performance-using-tensor-cores-and-astronomical-imaging-on-gpus.pdf

And, the best approach was to map the problem to an equivalent real problem. On the other hand, planar complex GEMMs are mentioned in the latest cutlass profiler:

https://github.com/NVIDIA/cutlass/blob/fb335f6a5fe19d22a834b19706707f4489427482/tools/profiler/src/gemm_operation_profiler.cu#L49

But, I suspect (line 139 in the above file), that only the basic GEMM is covered.

Cutlass aside, it seems that cublas will handle the problem:

https://docs.nvidia.com/cuda/cublas/index.html#cublasLt-example-planar-complex

That is, so long as the input and output are separated into real and imaginary parts. Ideally, I’d like to use an interleaved layout.

If you have a view on the best approach, I’d welcome the input.

Thanks!

Gary

mnicely · February 4, 2020, 4:33am

The easiest way would be to use cublasLt. You are correct in that cublasLt currently requires planar layout. There is a function to help transform from interleaved to planar but with a negative impact to performance.

You should also have to functionality in CUTLASS to do what you’re asking. I think you should be able to mod https://github.com/NVIDIA/cutlass/blob/fb335f6a5fe19d22a834b19706707f4489427482/examples/08_turing_tensorop_gemm/turing_tensorop_gemm.cu pretty easily.

If you’re able to come to GTC 2020, you should be updated presentations on both libraries I think.

gary.ballantyne · February 4, 2020, 8:27am

Thanks–could you please link to the interleaved-to-planar function?

There is another interesting presentation on this topic, for anyone following along:

“Towards Half-Precision Computation for Complex Matrices”
https://www.icl.utk.edu/files/publications/2019/icl-utk-1308-2019.pdf

gary.ballantyne · February 5, 2020, 2:26am

The planar <-> interleaved functionality is in mnicely’s complex half precision example:

https://github.com/mnicely/cublasLt_examples/blob/master/cublasLt_C16F_TCs.cu

Topic		Replies	Views
Tensor core boiler plate with cublas, can not compile GPU-Accelerated Libraries cudnn	3	14	February 18, 2025
Is there any official benchmark tool to test a GPU's FLOPS? GPU-Accelerated Libraries cublas , cutlass	3	4809	October 24, 2023
cuBLAS GEMM INT8 is much slower than FP16 in T4 GPU-Accelerated Libraries cublas	11	4156	November 2, 2023
Why performance is worse with CUBLAS- than with kernel-function GPU-Accelerated Libraries	3	895	September 5, 2019
Support fp16 for more cublas/cusolver? GPU-Accelerated Libraries cuda	1	376	March 8, 2021
cublasGemmEx is a Tensor Core operation or CUDA core? GPU-Accelerated Libraries cublas	3	908	October 3, 2021
Understanding cutlass GEMM hierarchy GPU-Accelerated Libraries cutlass	1	3395	October 14, 2021
Run Parallel Tensor Cores GEMM and Cuda GEMM GPU-Accelerated Libraries cuda , cublas	9	2460	August 14, 2022
Is CUBLAS_GEMM_DEFAULT_TENSOR_OP in cublasGemmEX no longer supported? GPU-Accelerated Libraries cublas , cutensor	3	1220	September 6, 2023
TF32 GEMM sample very slow compared to generic GEMM CUDA Programming and Performance	5	750	June 30, 2022

complex FP16 tensor core GEMMs

Related topics