Ada GeForce (RTX 4090) FP8 cuBLASLt performance

dpcc · April 20, 2023, 9:23pm

Hello,

I noticed in CUDA 12.1 update 1 that FP8 matrix multiples are now supported on Ada chips when using cuBLASLt. However, when I tried a benchmark on an RTX 4090 I was only able to achieve 1/2 of the rated throughput, around ~330-340 TFLOPS. My benchmark was a straightforward modification of the cuBLASLt FP8 sample to use larger matrices, run more iterations and use CUDA streams. I primarily tried N = N = K = 8192, but other sizes had similar behavior. I tried this with both FP16 and FP32 output and got the same result, although I was only able to use FP32 for the compute type as this is the only supported mode in cuBLASLt right now.

My result is quite far off from the specified 660 TFLOPs in the Ada whitepaper for FP8 tensor TFLOPs with FP32 accumulate. Is there a mistake in the white paper, or is there some incorrect throttling of FP8->FP32 operations going on (much like how FP16 → FP32 operations are half-rate on GeForce cards)?

As a sanity check I also modified the code to benchmark a few other configurations:

FP16 with FP32 accumulate ~ 170 TFLOPs
FP16 with FP16 accumulate ~ 270 TFLOPs
INT8 with INT32 accumulate ~ 560 TOPs

These other benchmarks all seem to be within 20% of the rated spec, so I don’t think my benchmark is the issue here.

Also, should we expect FP16 accumulation to be supported at some point by cuBLAS? Otherwise it seems like right now using FP8 on Geforce Ada cards has relatively little benefit over pure FP16 operations. For context I’m primarily interested in inference applications anyways so the higher precision accumulation is not too necessary.

dpcc · April 20, 2023, 9:44pm

I was also able to get the following results on a L4 (in Google Cloud):

FP8 with FP32 accumulate: 188 TFLOPS
FP16 with FP32 accumulate: 87 TFLOPS
FP16 with FP16 accumulate: 85 TFLOPS
INT8 with INT32 accumulate: 165 TOPS

So relatively speaking the FP8 performance is more like INT8 here, as expected.

Robert_Crovella · April 20, 2023, 10:30pm

I think one possibility might be that it may have to do with clock throttling triggered by power capping.

If you look at the L40 specs in the same whitepaper, based on the same GPU die (AD102), the L40 actually has higher specs in terms of SMs/GPU and Tensorcores/GPU, but has a lower stated FP8 performance, roughly in line with your number. I also note footnote 1on the 4090 specifications, which state that performance is based on boost clock. If the GPU is from a power and thermal perspective able to maintain boost clock under this workload, then that might be an achievable number. However its possible that is not the case.

I don’t have a 4090 to test, myself. If you wanted to test this theory, I would make the duration of the GEMM operation as long as is feasible, perhaps doing many gemms one after another, and then monitor nvidia-smi carefully, particularly clocks section and clock throttle reasons, to see if anything is happening there. This is difficult to do, so not observing it is not exactly proof that it is not happening, but by playing around you may be able to make an observation.

dpcc · April 20, 2023, 10:49pm

Thanks for the quick response, Robert.

It’s interesting, the power consumption of the FP8 test actually never goes above 300 watts in nvidia-smi, despite 450W default power limit for the 4090. Of my 4 tests, the most power intensive one is the INT8 → INT32 which hits around 380 watts. I’ve also been able to max out the power limit in other applications. Clock speeds also seem reasonably close to the rated boost clock (~2.5GHz).

Any chance you could get some additional clarification from someone within NVIDIA?

germaniga4 · May 6, 2023, 12:15pm

Same problem when training LLMs. 4090 underutilized and wattage rarely goes higher than 250W with utilization around 75%.

794906124 · May 10, 2023, 10:58am

Hello, I’m testing the FP8 Matmul on RTX 4090 too. Would you mind sharing your test code? I tried the CUDALibrarySamples/cuBLASLt/LtFp8matmul, while the official code run into error when I change the M,N,K in main.cpp. To be specific, the code run into cublas status 13 and terminate once M > 1024 EVERY TIME (e.g. set the TestBench to be (2048, 64, 64, 1.0, 0.0, 1ULL * 1024 * 1024 * 1024)). My system is: CUDA 12.1, Driver 530, Ubuntu 22.04.

val.zapod.vz · June 1, 2023, 4:40am

much like how FP16 → FP32 operations are half-rate on GeForce cards

Maybe there is some way to get around that? Is it software limitation?

LukeCuda · November 2, 2023, 10:10pm

Any updates here? Nvidia is a hardware marketing company and always lag the software because it doesn’t sell cards!!

Topic		Replies	Views
How to achieve 56 TFLOPS performance on RTX 500 Ada? CUDA Programming and Performance cuda	11	113	April 20, 2025
FP8 Benchmark Program for RTX 4090 GPU-Accelerated Libraries cublas	0	700	June 17, 2024
Fp8/fp16 accumulation on ada RTX 4090 GPU-Accelerated Libraries cuda , cublas	2	1225	June 5, 2024
L40 vs. RTX 6000 Ada FP16/FP8 throughput? GPU - Hardware benchmarks	7	14539	April 4, 2023
Why is matrix multiplication quite slow and all hardware seems to be only half-used? CUDA Programming and Performance cuda	11	367	November 4, 2024
cuBLAS GEMM INT8 is much slower than FP16 in T4 GPU-Accelerated Libraries cublas	11	4315	November 2, 2023
New cuBLAS 12.0 Features and Matrix Multiplication Performance on NVIDIA Hopper GPUs Technical Blog	0	524	February 1, 2023
cuBLAS INT8 tensor core mode vs. FP16 mode GPU-Accelerated Libraries	0	885	February 15, 2019
why the Tesla T4 peak performance test result mismatch with the official doc CUDA Programming and Performance	8	2477	October 19, 2019
4090 doesn't have fp8 compute? CUDA Programming and Performance	20	14806	August 6, 2024

Ada GeForce (RTX 4090) FP8 cuBLASLt performance

Related topics