SGEMM FP16 compute?

LukeCuda · May 25, 2016, 11:10pm

Hi, currently SGEMMex partially supports FP16, in that it will accept inputs and outputs as FP16, but it does the internal operation as FP32. Pascal P100 is advertised as having twice the FP16 performance as FP32. Have NVIDIA updated SGEMMex to support FP16 operations yet? I can not find any mention of how to do this. It seems to only appear in marketing papers. So does it really exist or is NVIDIA marketing getting ahead of reality again?

Robert_Crovella · May 26, 2016, 1:52am

hgemm:

[url]cuBLAS :: CUDA Toolkit Documentation

if you try and run hgemm on a device that does not support it, you will get:

CUBLAS_STATUS_ARCH_MISMATCH

the device does not support double-precision or in the case of cublasHgemm the device does not support math in half precision

Since the current definition of Sgemmex is:

“This function is an extension of cublasSgemm where the input matrices and output matrices can have a lower precision but the computation is still done in float”

I’m not sure you’ll ever see Sgemmex modified to perform the calculation at half precision. That would more or less be contrary to the S in Sgemm

LukeCuda · May 26, 2016, 3:11am

Amazing I have never seen HGEMM. Apparently it came out with 7.5!

Yes I would not expect SGEMMex to be changed if there is a HGEMM.

Sorry for doubting you Nvidia!

LukeCuda · May 26, 2016, 3:28am

how do we know if the GTX 1080 supports HGEMM? Documentation is sparse to say the least!!!

LukeCuda · May 30, 2016, 12:00am

No point using HGEMM now. HGEMM is only useful to the unreleased Tesla cards because GTX 1080 is 1/64 FP16 performance.

Your only option is to stick with SGEMMex halfs.

hma02 · December 4, 2016, 4:17pm

I just tried a little on using fp16 on P100 here:

Has anyone also benchmarked this performance? I wonder if there is a formal tool for doing this test. Is cublasHgemm representative enough?

njuffa · December 4, 2016, 7:19pm

GEMM on large (e.g. 8K x 8K), square matrices is usually a good indicator for peak floating-point throughput. But even a GEMM implementation that has been carefully crafted in native assembly language is unlikely to achieve more than 90% of the theoretical peak, for compiled versions it would typically be more like 70-75%. Note that throughput may differ somewhat by transpose mode, often the “NT” variant (i.e., no transpose on matrix A, transpose on matrix B) variant is the fastest.

Topic		Replies	Views
INT8 cublasGemmEx support on Tegra X2 and Tesla P100 GPU-Accelerated Libraries	4	1804	October 17, 2017
Performance of GF10x GPU CUDA Programming and Performance	8	2638	April 24, 2013
cublasHgemm is slower than cublasSgemm in CUDA 11.1 when I use? GPU-Accelerated Libraries	2	502	December 1, 2020
Does CUBLAS SGEMM work with tensor cores yet? GPU-Accelerated Libraries	3	1101	February 26, 2020
CUBLAS Performance Many algorithms perform abysmally CUDA Programming and Performance	6	7599	February 3, 2008
SGEMM and SGEMV - large performance difference in cuBLAS CUDA Programming and Performance	1	305	April 7, 2024
Why does cublasSgemm uses `f16` for `float`? GPU-Accelerated Libraries cublas	7	1282	March 8, 2023
Is cublasHgemm pure half multiplication? GPU-Accelerated Libraries cublas	4	893	January 24, 2023
double precision on mobile GPU CUDA Programming and Performance	17	7973	October 30, 2011
why cublasHgemm is slower more than cublasSgemm when I use? GPU-Accelerated Libraries	6	4311	January 22, 2019

SGEMM FP16 compute?

Related topics