fp32 sgemm and fp16 hgemm

After replacing fp32 sgemm to fp16 hgemm in a forward function, I only have 16% speed gain in the function.
How to program one fp16 hgemm call to perform tasks equivalent to two sgemm call?
I hope this can halve number of calls and double speed gain, as in typical SIMD programming.

nvprof results.

Time(%)      Time     Calls       Avg       Min       Max  Name
  0.06%  28.513ms       200  142.57us  139.27us  146.62us  void magma_hgemm_kernel<__half, __half, bool=1, bool=0, int=5, int=4, int=5, int=3, int=3, bool=0>(int, int, int, __half const *, int, __half const *, int, __half*, int, int, int, __half const *, __half const *, __half, __half, int)

Time(%)      Time     Calls       Avg       Min       Max  Name
  0.07%  33.101ms       200  165.50us  160.73us  170.26us  void magma_sgemmEx_kernel<__half, __half, bool=1, bool=0, int=6, int=4, int=6, int=3, int=4, bool=0>(int, int, int, void const *, int, void const *, int, void*, int, int, int, float const *, float const *, float, float, int)

my sample code

#ifdef __FP16_PSEUDO__    
	#define CUDA_GEMM      cublasSgemmEx 
#endif

#ifdef __FP16_NATIVE__ 
	#define CUDA_GEMM  cublasHgemm
#endif

#define FLOAT half    
#define INT   short   

forward(
){
	int M = ins[0]->N;
	int K = ins[0]->C*ins[0]->H*ins[0]->W;  
	int N = outs[0]->C*outs[0]->H*outs[0]->W; 
  
	FLOAT *ks = params[0]->data;
	FLOAT *xs = ins[0]->data;
	FLOAT *ys = outs[0]->data; 

#ifdef __FP16_NATIVE__
       const half ONE = hone();
       const half ZERO = hzero(); 
       #else
	const float ONE  = 1; 
	const float ZERO = 0; 
       #endif
         
    ASSERT(0 == CUDA_GEMM(
	CUBLAS_HANDLER, 
	CUBLAS_OP_T, 
	CUBLAS_OP_N,
	N, M, K, 
	&ONE, 
	ks, 
        #ifndef __FP16_NATIVE__
        CUBLAS_FLOAT,
        #endif
        K,
	xs, 
        #ifndef __FP16_NATIVE__
        CUBLAS_FLOAT,
        #endif
        K,
	&ZERO, 
	ys, 
        #ifndef __FP16_NATIVE__
        CUBLAS_FLOAT,
        #endif
        N
	));

}