Multiplying FP16 large matrices with cublasLtMatmul on RTX 3070 and V100

Hi all,

I have two matrices A (500 * 5.1M) and B (5.1M * 500) in FP16 data type. I use cublasLtMatmul from cublasLt to multiply these matrices. I run it on NVIDIA V100. When I put the cublasLtMatmul(A,B) function in a loop, the first time it takes around 1570 ms and for other iterations it takes around 102 ms. It is almost 15x slower. If I want to run it one time I do not gain any speedup. Why this happens? When I apply the same situation for TF32, for the first time the execution time is around 479ms and the second and third are around 454ms.

I have also tried A (50 * 2.1M) and B (2.1M * 50) in FP16 data type. I use cublasLtMatmul from cublasLt to multiply these matrices. I run it on NVIDIA Ti RTX 3070 of my laptop. When I put the cublasLtMatmul(A,B) function in a loop, the first time it takes around 56 ms and for other iterations it takes around 1ms. It is almost 56x slower. When I apply the same situation for TF32, for the first time the execution time is around 4.5ms and the second and third are around 2.5ms.

Thank you