SGEMM FP16 compute?

Hi, currently SGEMMex partially supports FP16, in that it will accept inputs and outputs as FP16, but it does the internal operation as FP32. Pascal P100 is advertised as having twice the FP16 performance as FP32. Have NVIDIA updated SGEMMex to support FP16 operations yet? I can not find any mention of how to do this. It seems to only appear in marketing papers. So does it really exist or is NVIDIA marketing getting ahead of reality again?

hgemm:

http://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-gemm

if you try and run hgemm on a device that does not support it, you will get:

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision or in the case of cublasHgemm the device does not support math in half precision

Since the current definition of Sgemmex is:

“This function is an extension of cublasSgemm where the input matrices and output matrices can have a lower precision but the computation is still done in float”

I’m not sure you’ll ever see Sgemmex modified to perform the calculation at half precision. That would more or less be contrary to the S in Sgemm

Amazing I have never seen HGEMM. Apparently it came out with 7.5!

Yes I would not expect SGEMMex to be changed if there is a HGEMM.

Sorry for doubting you Nvidia!

how do we know if the GTX 1080 supports HGEMM? Documentation is sparse to say the least!!!

No point using HGEMM now. HGEMM is only useful to the unreleased Tesla cards because GTX 1080 is 1/64 FP16 performance.

Your only option is to stick with SGEMMex halfs.

I just tried a little on using fp16 on P100 here:

Has anyone also benchmarked this performance? I wonder if there is a formal tool for doing this test. Is cublasHgemm representative enough?

GEMM on large (e.g. 8K x 8K), square matrices is usually a good indicator for peak floating-point throughput. But even a GEMM implementation that has been carefully crafted in native assembly language is unlikely to achieve more than 90% of the theoretical peak, for compiled versions it would typically be more like 70-75%. Note that throughput may differ somewhat by transpose mode, often the “NT” variant (i.e., no transpose on matrix A, transpose on matrix B) variant is the fastest.