cublasDdgmm vs. cublasSdgmm

Hi,

I have a N*K double dense matrix X. I compute G= X^T * X using cublasDdgmm on A100. For N = 5.1M and K=257K the execution time on A100 is around 34 ms. When I convert it from double to float and use cublasSdgmm to compute G the execution time on A100 is around 44ms. I expected shorter execution time. Why does this happen? Doesn’t cublasSdgmm use Tensor Cores?

regards,

No it does not. There are no pure-FP32 paths in any GPU TC units, currently.

I don’t know for sure that is the source of the difference. For A100 the peak FP64 TC throughput is 19.5TF, and FP32 non-TC throughput is also 19.5TF.

Thank you very much for your reply.