cublasHgemm did not faster than cublasSgemm on 2080Ti

I am testing cublasHgemm on 2080Ti, according to the product docs, 2080Ti has fast fp16 mode which should be 2x faster than fp32, but when I run it on 2080Ti, it did not faster. the benchmark app was compiled on 1080Ti with cuda 10.2 and then run on 2080Ti, I have add nvcc flag -arch=sm_75.

according to my test, the matrix need to be very large, then fp16 will be faster than fp32

Can you provide a reproducer and results?