I recently tested the cgemm function on a Geforce RTX 3090. I was surprised to get only ~60% of the theoretical peak Gflops, whereas with older GPUs i used to get between 80 and 90% (see pictures).
Is it normal ? Do you get similar results on 3080/3070 cards ?
The benchmark was done using CUDA 11.2.