RTX 2060 for Half Precision matrix multiply slower than Float precioson

I have a RTX 2060 GPU on Windows system ,my CPU is i5-9400F and memory is 8g ,when I use the half precision to compute the matrix multiply ,my half precision is slower than the float ,so I suspect that my CPU and memory do not match my GPU .

Environment:
Cuda: 10.0.130
Language: Python
Framework: Pytorch 1.0.0

a=torch.Tensor(128,64,64).cuda()
torch.nn.init.normal_(a,mean=0.,std=1.0)
b=torch.Tensor(128,64,64).cuda()
torch.nn.init.normal_(b,mean=0.,std=1.0)
torch.bmm(a,b) #time is 0.48713
c = a.half()
d = b.half()
torch.bmm(c,d) # time is 0.77825