I am testing cublasHgemm on 2080Ti, according to the product docs, 2080Ti has fast fp16 mode which should be 2x faster than fp32, but when I run it on 2080Ti, it did not faster. the benchmark app was compiled on 1080Ti with cuda 10.2 and then run on 2080Ti, I have add nvcc flag -arch=sm_75.
according to my test, the matrix need to be very large, then fp16 will be faster than fp32
Can you provide a reproducer and results?