Hi,
I have got some results on A100 with cuBLAS and cuBLASLt that are strange for me and do not know the reason and why they happen. I would be glad if anyone can help me to interpret and analyse the results.
I have two matrices X and pX. Both are N*K double dense matrices. I compute G= X^T * pX. N = 2.1M and K=50. When I use cublasDdgmm on A100, the execution time on A100 is around 28 ms. When I convert it from double to float and use cublasSdgmm to compute G the execution time on A100 is around 40ms.
Also I tried cublasGemmEx. When the computeType is CUBLAS_COMPUTE_64F and and X, pX and G are CUDA_R_64F, the execution time is around 28 ms. When computeType is CUBLAS_COMPUTE_32F and and X, pX and G are CUDA_R_32F, the execution time is around 41 ms. When computeType is CUBLAS_COMPUTE_32F_FAST_TF32 or CUBLAS_COMPUTE_32F_FAST_16BF and X, pX and G are CUDA_R_32F, the execution time is around 38.6 ms.
Again for cublasLtMatmul of cuBLASLt, when I set the computeType to CUBLAS_COMPUTE_64F and X, pX and G are CUDA_R_64F, the execution time is around 4ms. When I convert it from double to float and set the computeType to CUBLAS_COMPUTE_32F and X, pX and G are CUDA_R_32F, the execution time becomes around 8ms. What is more strange is that when I set the computeType to CUBLAS_COMPUTE_32F_FAST_TF32 and X, pX and G are CUDA_R_32F, the execution time becomes around 16ms. I use 2MB fro workspace and use std::chrono::high_resolution_clock::now() to measure the start and end time.
Would you please help me where is my mistake and why do I get these results? Are these results correct? I expected that TF32 be faster that Float and Float to be faster that Double. But the results are all in opposite direction. Does it relate to the sizes of matrices? Thank you.
Kind regards,