CUDA lib performance on Ampere architecture

jfdeg256 · April 22, 2021, 8:25am

Hello,

I recently tested the cgemm function on a Geforce RTX 3090. I was surprised to get only ~60% of the theoretical peak Gflops, whereas with older GPUs i used to get between 80 and 90% (see pictures).

Is it normal ? Do you get similar results on 3080/3070 cards ?

The benchmark was done using CUDA 11.2.

Jeff

njuffa · April 22, 2021, 8:59am

For ease of reproduction you might want to mention whether these are square matrices, and what transpose modes were used.

Have you tried looking at this with the CUDA profiler? This is just wild speculation, but even tiled GEMM implementations require a lot of memory bandwidth, and with FLOPS always growing faster than memory bandwidth, and given the increased memory bandwidth of GEMM with complex types, CGEMM may have become partially limited by memory throughput on Ampere-GPUs that don’t use HBM2.

jfdeg256 · April 22, 2021, 9:50am

Thanks for you answer.

Square matrices indeed, forgot to mention. That’s an interesting hypothesis. I will try SGEMM to see if I got similar results. I remember NVIDIA posting SGEMM performance during CUDA presentations some years ago, but I didn’t see that recently. If someone has numbers for SGEMM, please share it :)

I’ll also take a look at profiler.

Jeff

Topic		Replies	Views
Strange FP16 GEMM aPeak Performance & RTX3090 GPU-Accelerated Libraries cublas	1	654	September 23, 2022
CUBLAS SGEMM performance CUDA Programming and Performance	5	10684	October 5, 2007
CUBLAS Performance Many algorithms perform abysmally CUDA Programming and Performance	6	7599	February 3, 2008
Performance of GF10x GPU CUDA Programming and Performance	8	2639	April 24, 2013
speedy CGEMM reaches 448 Gflop/s CUDA Programming and Performance	1	2758	March 22, 2010
CUBLAS SGEMM on highly rectangular matrices CUDA Programming and Performance	1	3226	February 20, 2010
How to disable/enable ECC on C2050? CUDA Programming and Performance	22	14019	April 24, 2010
SGEMM performance of current Kepler GPUs? CUDA Programming and Performance	14	4734	July 25, 2014
Why tiled MMM can only achieve around 40GFLOPS ? CUDA Programming and Performance	7	4231	March 24, 2008
Volta 100 LINPACK performance and energy-efficiency CUDA Programming and Performance	4	971	February 26, 2018

CUDA lib performance on Ampere architecture

Related topics