complex FP16 tensor core GEMMs

Hello

If someone knows the best (easiest to code) way to do a half-precision GEMM using tensor cores, I’d really appreciate any help.

It seems that, about a year ago, this wasn’t possible in cutlass (page 4):

https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9306-extreme-signal-processing-performance-using-tensor-cores-and-astronomical-imaging-on-gpus.pdf

And, the best approach was to map the problem to an equivalent real problem. On the other hand, planar complex GEMMs are mentioned in the latest cutlass profiler:

https://github.com/NVIDIA/cutlass/blob/fb335f6a5fe19d22a834b19706707f4489427482/tools/profiler/src/gemm_operation_profiler.cu#L49

But, I suspect (line 139 in the above file), that only the basic GEMM is covered.

Cutlass aside, it seems that cublas will handle the problem:

https://docs.nvidia.com/cuda/cublas/index.html#cublasLt-example-planar-complex

That is, so long as the input and output are separated into real and imaginary parts. Ideally, I’d like to use an interleaved layout.

If you have a view on the best approach, I’d welcome the input.

Thanks!

Gary

The easiest way would be to use cublasLt. You are correct in that cublasLt currently requires planar layout. There is a function to help transform from interleaved to planar but with a negative impact to performance.

You should also have to functionality in CUTLASS to do what you’re asking. I think you should be able to mod https://github.com/NVIDIA/cutlass/blob/fb335f6a5fe19d22a834b19706707f4489427482/examples/08_turing_tensorop_gemm/turing_tensorop_gemm.cu pretty easily.

If you’re able to come to GTC 2020, you should be updated presentations on both libraries I think.

Thanks–could you please link to the interleaved-to-planar function?

There is another interesting presentation on this topic, for anyone following along:

“Towards Half-Precision Computation for Complex Matrices”
https://www.icl.utk.edu/files/publications/2019/icl-utk-1308-2019.pdf

The planar <-> interleaved functionality is in mnicely’s complex half precision example:

https://github.com/mnicely/cublasLt_examples/blob/master/cublasLt_C16F_TCs.cu