Some results in A100 with cuBLAS and cuBLASLt

h.noori · January 8, 2025, 11:57am

Hi,

I have got some results on A100 with cuBLAS and cuBLASLt that are strange for me and do not know the reason and why they happen. I would be glad if anyone can help me to interpret and analyse the results.

I have two matrices X and pX. Both are N*K double dense matrices. I compute G= X^T * pX. N = 2.1M and K=50. When I use cublasDdgmm on A100, the execution time on A100 is around 28 ms. When I convert it from double to float and use cublasSdgmm to compute G the execution time on A100 is around 40ms.

Also I tried cublasGemmEx. When the computeType is CUBLAS_COMPUTE_64F and and X, pX and G are CUDA_R_64F, the execution time is around 28 ms. When computeType is CUBLAS_COMPUTE_32F and and X, pX and G are CUDA_R_32F, the execution time is around 41 ms. When computeType is CUBLAS_COMPUTE_32F_FAST_TF32 or CUBLAS_COMPUTE_32F_FAST_16BF and X, pX and G are CUDA_R_32F, the execution time is around 38.6 ms.

Again for cublasLtMatmul of cuBLASLt, when I set the computeType to CUBLAS_COMPUTE_64F and X, pX and G are CUDA_R_64F, the execution time is around 4ms. When I convert it from double to float and set the computeType to CUBLAS_COMPUTE_32F and X, pX and G are CUDA_R_32F, the execution time becomes around 8ms. What is more strange is that when I set the computeType to CUBLAS_COMPUTE_32F_FAST_TF32 and X, pX and G are CUDA_R_32F, the execution time becomes around 16ms. I use 2MB fro workspace and use std::chrono::high_resolution_clock::now() to measure the start and end time.

Would you please help me where is my mistake and why do I get these results? Are these results correct? I expected that TF32 be faster that Float and Float to be faster that Double. But the results are all in opposite direction. Does it relate to the sizes of matrices? Thank you.

Kind regards,

h.noori · January 9, 2025, 5:01pm

For previous cublasLtMatmul results I set const cublasLtMatmulAlgo_t *algo to nullptr. I replaced it with
// Heuristic to find the best algorithm
cublasLtMatmulAlgo_t algo;
int returnedResults = 0;
cublasLtMatmulHeuristicResult_t heuristicResult;
CHECK_CUBLAS(cublasLtMatmulAlgoGetHeuristic(ltHandle, operationDesc, layout_X, layout_pX, layout_G, layout_G, preference, 1, &heuristicResult, &returnedResults));

if (returnedResults == 0) {
    std::cerr << "No suitable algorithms found!\n";
    exit(EXIT_FAILURE);
}
algo = heuristicResult.algo;

Now, the results are as following when matrices are:
1- double data type and computeType = CUBLAS_COMPUTE_64F is 4ms
2- float data type and computeType = CUBLAS_COMPUTE_32F, CUBLAS_COMPUTE_32F_FAST_TF32, CUBLAS_COMPUTE_32F_FAST_16BF and CUBLAS_COMPUTE_32F_FAST_16F is 4.3ms
3- BF16 and computeType = CUBLAS_COMPUTE_32F, CUBLAS_COMPUTE_32F_FAST_TF32, CUBLAS_COMPUTE_32F_FAST_16BF and CUBLAS_COMPUTE_32F_FAST_16F is 4.6ms

Topic		Replies	Views
Multiplying FP16 large matrices with cublasLtMatmul on RTX 3070 and V100 GPU-Accelerated Libraries cublas	0	69	March 31, 2025
cublasDdgmm vs. cublasSdgmm GPU-Accelerated Libraries cublas	2	84	January 7, 2025
cusparseLtMatmul is slower than cublasGemmEx GPU-Accelerated Libraries cublas , cusparse	0	652	April 21, 2023
why cublasHgemm is slower more than cublasSgemm when I use? GPU-Accelerated Libraries	6	4412	January 22, 2019
why is cublasHgemm is slower than cublasSgemm when matrix is low dimension GPU-Accelerated Libraries	0	485	January 22, 2019
cuBLAS GEMM 2.5 times slower on 4090 than on 3090? GPU-Accelerated Libraries cublas , curand	0	455	December 25, 2023
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	1024	August 23, 2018
Why does cublasSgemm uses `f16` for `float`? GPU-Accelerated Libraries cublas	7	1444	March 8, 2023
Slow CUDA SGEMM CUDA Programming and Performance	5	753	September 15, 2022
cuBLAS convolution does not use Tensor Cores GPU-Accelerated Libraries cublas	6	2355	June 8, 2021

Some results in A100 with cuBLAS and cuBLASLt

Related topics