I noticed in CUDA 12.1 update 1 that FP8 matrix multiples are now supported on Ada chips when using cuBLASLt. However, when I tried a benchmark on an RTX 4090 I was only able to achieve 1/2 of the rated throughput, around ~330-340 TFLOPS. My benchmark was a straightforward modification of the cuBLASLt FP8 sample to use larger matrices, run more iterations and use CUDA streams. I primarily tried N = N = K = 8192, but other sizes had similar behavior. I tried this with both FP16 and FP32 output and got the same result, although I was only able to use FP32 for the compute type as this is the only supported mode in cuBLASLt right now.
My result is quite far off from the specified 660 TFLOPs in the Ada whitepaper for FP8 tensor TFLOPs with FP32 accumulate. Is there a mistake in the white paper, or is there some incorrect throttling of FP8->FP32 operations going on (much like how FP16 → FP32 operations are half-rate on GeForce cards)?
As a sanity check I also modified the code to benchmark a few other configurations:
- FP16 with FP32 accumulate ~ 170 TFLOPs
- FP16 with FP16 accumulate ~ 270 TFLOPs
- INT8 with INT32 accumulate ~ 560 TOPs
These other benchmarks all seem to be within 20% of the rated spec, so I don’t think my benchmark is the issue here.
Also, should we expect FP16 accumulation to be supported at some point by cuBLAS? Otherwise it seems like right now using FP8 on Geforce Ada cards has relatively little benefit over pure FP16 operations. For context I’m primarily interested in inference applications anyways so the higher precision accumulation is not too necessary.
I was also able to get the following results on a L4 (in Google Cloud):
FP8 with FP32 accumulate: 188 TFLOPS
FP16 with FP32 accumulate: 87 TFLOPS
FP16 with FP16 accumulate: 85 TFLOPS
INT8 with INT32 accumulate: 165 TOPS
So relatively speaking the FP8 performance is more like INT8 here, as expected.
I think one possibility might be that it may have to do with clock throttling triggered by power capping.
If you look at the L40 specs in the same whitepaper, based on the same GPU die (AD102), the L40 actually has higher specs in terms of SMs/GPU and Tensorcores/GPU, but has a lower stated FP8 performance, roughly in line with your number. I also note footnote 1on the 4090 specifications, which state that performance is based on boost clock. If the GPU is from a power and thermal perspective able to maintain boost clock under this workload, then that might be an achievable number. However its possible that is not the case.
I don’t have a 4090 to test, myself. If you wanted to test this theory, I would make the duration of the GEMM operation as long as is feasible, perhaps doing many gemms one after another, and then monitor
nvidia-smi carefully, particularly clocks section and clock throttle reasons, to see if anything is happening there. This is difficult to do, so not observing it is not exactly proof that it is not happening, but by playing around you may be able to make an observation.
Thanks for the quick response, Robert.
It’s interesting, the power consumption of the FP8 test actually never goes above 300 watts in nvidia-smi, despite 450W default power limit for the 4090. Of my 4 tests, the most power intensive one is the INT8 → INT32 which hits around 380 watts. I’ve also been able to max out the power limit in other applications. Clock speeds also seem reasonably close to the rated boost clock (~2.5GHz).
Any chance you could get some additional clarification from someone within NVIDIA?
Same problem when training LLMs. 4090 underutilized and wattage rarely goes higher than 250W with utilization around 75%.
Hello, I’m testing the FP8 Matmul on RTX 4090 too. Would you mind sharing your test code? I tried the
CUDALibrarySamples/cuBLASLt/LtFp8matmul, while the official code run into error when I change the
M,N,K in main.cpp. To be specific, the code run into cublas status 13 and terminate once
M > 1024 EVERY TIME (e.g. set the TestBench to be
(2048, 64, 64, 1.0, 0.0, 1ULL * 1024 * 1024 * 1024)). My system is: CUDA 12.1, Driver 530, Ubuntu 22.04.
much like how FP16 → FP32 operations are half-rate on GeForce cards
Maybe there is some way to get around that? Is it software limitation?