Performace on A100SXM40GB TF32 vs FP32

I do a matmul on two 10240×10240 matrices.
As I know when I activate TF32 mode on A100 I should get performance.
Based on the report, it should be :
FP32: 0.14s
TF32: 0.018s
But I get:
FP32: 0.619
TF32: 1.785

CUDA version 11.7
PyTorch 1.13

What is the problem with this code?
By running the FP32 before TF32, the data should even be o the RAM. But still the performance is worse.

import torch
import time
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

a_full = torch.randn(10240, 10240, dtype=torch.float, device='cuda')
b_full = torch.randn(10240, 10240, dtype=torch.float, device='cuda')
ab_full = a_full @ b_full

a = a_full.float()
b = b_full.float()

# Do matmul with TF32 disabled.
torch.backends.cudnn.allow_tf32 = False
torch.backends.cuda.matmul.allow_tf32 = False
start = time.time()
ab_fp32 = a @ b # takes 0.14s on GA100
end = time.time()
print("FP32:",(end - start)*1000)

# Do matmul at TF32 mode.
torch.backends.cudnn.allow_tf32 = True
torch.backends.cuda.matmul.allow_tf32 = True
start = time.time()
ab_tf32 = a @ b # takes 0.018s on GA100
end = time.time()
print("TF32:",(end - start)*1000)

I just found here how to fix it: Link