fp32 sgemm and fp16 hgemm

ws1 · July 4, 2016, 3:45am

After replacing fp32 sgemm to fp16 hgemm in a forward function, I only have 16% speed gain in the function.
How to program one fp16 hgemm call to perform tasks equivalent to two sgemm call?
I hope this can halve number of calls and double speed gain, as in typical SIMD programming.

nvprof results.

Time(%)      Time     Calls       Avg       Min       Max  Name
  0.06%  28.513ms       200  142.57us  139.27us  146.62us  void magma_hgemm_kernel<__half, __half, bool=1, bool=0, int=5, int=4, int=5, int=3, int=3, bool=0>(int, int, int, __half const *, int, __half const *, int, __half*, int, int, int, __half const *, __half const *, __half, __half, int)

Time(%)      Time     Calls       Avg       Min       Max  Name
  0.07%  33.101ms       200  165.50us  160.73us  170.26us  void magma_sgemmEx_kernel<__half, __half, bool=1, bool=0, int=6, int=4, int=6, int=3, int=4, bool=0>(int, int, int, void const *, int, void const *, int, void*, int, int, int, float const *, float const *, float, float, int)

my sample code

#ifdef __FP16_PSEUDO__    
	#define CUDA_GEMM      cublasSgemmEx 
#endif

#ifdef __FP16_NATIVE__ 
	#define CUDA_GEMM  cublasHgemm
#endif

#define FLOAT half    
#define INT   short   

forward(
){
	int M = ins[0]->N;
	int K = ins[0]->C*ins[0]->H*ins[0]->W;  
	int N = outs[0]->C*outs[0]->H*outs[0]->W; 
  
	FLOAT *ks = params[0]->data;
	FLOAT *xs = ins[0]->data;
	FLOAT *ys = outs[0]->data; 

#ifdef __FP16_NATIVE__
       const half ONE = hone();
       const half ZERO = hzero(); 
       #else
	const float ONE  = 1; 
	const float ZERO = 0; 
       #endif
         
    ASSERT(0 == CUDA_GEMM(
	CUBLAS_HANDLER, 
	CUBLAS_OP_T, 
	CUBLAS_OP_N,
	N, M, K, 
	&ONE, 
	ks, 
        #ifndef __FP16_NATIVE__
        CUBLAS_FLOAT,
        #endif
        K,
	xs, 
        #ifndef __FP16_NATIVE__
        CUBLAS_FLOAT,
        #endif
        K,
	&ZERO, 
	ys, 
        #ifndef __FP16_NATIVE__
        CUBLAS_FLOAT,
        #endif
        N
	));

}

Topic		Replies	Views
SGEMM FP16 compute? CUDA Programming and Performance	6	3944	December 4, 2016
why is cublasHgemm is slower than cublasSgemm when matrix is low dimension GPU-Accelerated Libraries	0	489	January 22, 2019
why cublasHgemm is slower more than cublasSgemm when I use? GPU-Accelerated Libraries	6	4420	January 22, 2019
cublasHgemm is slower than cublasSgemm in CUDA 11.1 when I use? GPU-Accelerated Libraries	2	553	December 1, 2020
cublasHgemm did not faster than cublasSgemm on 2080Ti GPU-Accelerated Libraries cuda	2	613	September 14, 2020
Is cublasHgemm pure half multiplication? GPU-Accelerated Libraries cublas	4	1028	January 24, 2023
Why does cublasSgemm uses `f16` for `float`? GPU-Accelerated Libraries cublas	7	1462	March 8, 2023
Adapt FP32 operation with TF32 GPU-Accelerated Libraries cublas	4	730	October 7, 2021
Poor half performance CUDA Programming and Performance	13	2609	June 19, 2025
converting fp32 math to fp16 fails to give speed up CUDA Programming and Performance	5	1653	November 21, 2017

fp32 sgemm and fp16 hgemm

Related topics