Hi,
I have a N*K double dense matrix X. I compute G= X^T * X using cublasDdgmm on A100. For N = 5.1M and K=257K the execution time on A100 is around 34 ms. When I convert it from double to float and use cublasSdgmm to compute G the execution time on A100 is around 44ms. I expected shorter execution time. Why does this happen? Doesn’t cublasSdgmm use Tensor Cores?
regards,