CUDA lib performance on Ampere architecture


I recently tested the cgemm function on a Geforce RTX 3090. I was surprised to get only ~60% of the theoretical peak Gflops, whereas with older GPUs i used to get between 80 and 90% (see pictures).

Is it normal ? Do you get similar results on 3080/3070 cards ?

The benchmark was done using CUDA 11.2.


For ease of reproduction you might want to mention whether these are square matrices, and what transpose modes were used.

Have you tried looking at this with the CUDA profiler? This is just wild speculation, but even tiled GEMM implementations require a lot of memory bandwidth, and with FLOPS always growing faster than memory bandwidth, and given the increased memory bandwidth of GEMM with complex types, CGEMM may have become partially limited by memory throughput on Ampere-GPUs that don’t use HBM2.

Thanks for you answer.

Square matrices indeed, forgot to mention. That’s an interesting hypothesis. I will try SGEMM to see if I got similar results. I remember NVIDIA posting SGEMM performance during CUDA presentations some years ago, but I didn’t see that recently. If someone has numbers for SGEMM, please share it :)

I’ll also take a look at profiler.