After replacing fp32 sgemm to fp16 hgemm in a forward function, I only have 16% speed gain in the function.
How to program one fp16 hgemm call to perform tasks equivalent to two sgemm call?
I hope this can halve number of calls and double speed gain, as in typical SIMD programming.
nvprof results.
Time(%) Time Calls Avg Min Max Name
0.06% 28.513ms 200 142.57us 139.27us 146.62us void magma_hgemm_kernel<__half, __half, bool=1, bool=0, int=5, int=4, int=5, int=3, int=3, bool=0>(int, int, int, __half const *, int, __half const *, int, __half*, int, int, int, __half const *, __half const *, __half, __half, int)
Time(%) Time Calls Avg Min Max Name
0.07% 33.101ms 200 165.50us 160.73us 170.26us void magma_sgemmEx_kernel<__half, __half, bool=1, bool=0, int=6, int=4, int=6, int=3, int=4, bool=0>(int, int, int, void const *, int, void const *, int, void*, int, int, int, float const *, float const *, float, float, int)
my sample code
#ifdef __FP16_PSEUDO__
#define CUDA_GEMM cublasSgemmEx
#endif
#ifdef __FP16_NATIVE__
#define CUDA_GEMM cublasHgemm
#endif
#define FLOAT half
#define INT short
forward(
){
int M = ins[0]->N;
int K = ins[0]->C*ins[0]->H*ins[0]->W;
int N = outs[0]->C*outs[0]->H*outs[0]->W;
FLOAT *ks = params[0]->data;
FLOAT *xs = ins[0]->data;
FLOAT *ys = outs[0]->data;
#ifdef __FP16_NATIVE__
const half ONE = hone();
const half ZERO = hzero();
#else
const float ONE = 1;
const float ZERO = 0;
#endif
ASSERT(0 == CUDA_GEMM(
CUBLAS_HANDLER,
CUBLAS_OP_T,
CUBLAS_OP_N,
N, M, K,
&ONE,
ks,
#ifndef __FP16_NATIVE__
CUBLAS_FLOAT,
#endif
K,
xs,
#ifndef __FP16_NATIVE__
CUBLAS_FLOAT,
#endif
K,
&ZERO,
ys,
#ifndef __FP16_NATIVE__
CUBLAS_FLOAT,
#endif
N
));
}