why is cublasHgemm is slower than cublasSgemm when matrix is low dimension

float16; size 2 average: 1.74688e-05 s
float16; size 4 average: 1.09478e-05 s
float16; size 8 average: 1.2503e-05 s
float16; size 16 average: 1.40813e-05 s
float16; size 32 average: 2.8359e-05 s
float16; size 64 average: 2.8888e-05 s
float16; size 128 average: 3.22976e-05 s
float16; size 256 average: 3.71114e-05 s
float16; size 512 average: 7.82048e-05 s
float16; size 1024 average: 0.00015296 s
float16; size 2048 average: 0.000850379 s
float16; size 4096 average: 0.00551006 s
float16; size 8192 average: 0.0416439 s
float16; size 16384 average: 0.327726 s
float16; size 32768 average: 2.85781 s

float32; size 2 average: 1.26912e-05 s
float32; size 4 average: 8.088e-06 s
float32; size 8 average: 8.12896e-06 s
float32; size 16 average: 1.2599e-05 s
float32; size 32 average: 1.2537e-05 s
float32; size 64 average: 1.32061e-05 s
float32; size 128 average: 1.02701e-05 s
float32; size 256 average: 1.40413e-05 s
float32; size 512 average: 3.95216e-05 s
float32; size 1024 average: 0.000205253 s
float32; size 2048 average: 0.00137492 s
float32; size 4096 average: 0.012989 s
float32; size 8192 average: 0.0829209 s
float32; size 16384 average: 0.655603 s
float32; size 32768 average: 6.79938 s

when size >512 the time of float 16 <float32