Hello, I noticed in CUDA 12.1 update 1 that FP8 matrix multiples are now supported on Ada chips when using cuBLASLt. However, when I tried a benchmark on an RTX 4090 I was only able to achieve 1/2 of the rated throughput, around ~330-340 TFLOPS. My benchmark was a straightforward modification of th…

Ada GeForce (RTX 4090) FP8 cuBLASLt performance

dpcc April 20, 2023, 9:44pm 2

I was also able to get the following results on a L4 (in Google Cloud):

FP8 with FP32 accumulate: 188 TFLOPS
FP16 with FP32 accumulate: 87 TFLOPS
FP16 with FP16 accumulate: 85 TFLOPS
INT8 with INT32 accumulate: 165 TOPS

So relatively speaking the FP8 performance is more like INT8 here, as expected.

1 Like

Topic		Replies	Views
FP8 Benchmark Program for RTX 4090 GPU-Accelerated Libraries cublas	0	815	June 17, 2024
Tesla S2050 performance double precision performance too low CUDA Programming and Performance	42	29380	December 8, 2010
GTX 480 - performance CUDA Programming and Performance	8	6893	June 9, 2010
CUBLAS Performance Many algorithms perform abysmally CUDA Programming and Performance	6	7649	February 3, 2008
[Matrix Multiplication] GFlops on Nvidia Quadro FX 1700.... CUDA Programming and Performance	5	7813	April 16, 2010
Tesla C2050 (Fermi) benchmarking results CUDA Programming and Performance	18	8773	September 22, 2010
Blas Operations Performance of Cublas Operations CUDA Programming and Performance	1	3654	August 20, 2007
How to get more Gflops ? :) CUDA Programming and Performance	21	27755	September 12, 2008
testing LINPACK performance on HP xw8600+S1070 CUDA Programming and Performance	5	6699	February 12, 2009
benchmark CUDA CuBLas and OpenCL CUDA Programming and Performance	13	28145	February 1, 2011