Ada GeForce (RTX 4090) FP8 cuBLASLt performance

I was also able to get the following results on a L4 (in Google Cloud):

FP8 with FP32 accumulate: 188 TFLOPS
FP16 with FP32 accumulate: 87 TFLOPS
FP16 with FP16 accumulate: 85 TFLOPS
INT8 with INT32 accumulate: 165 TOPS

So relatively speaking the FP8 performance is more like INT8 here, as expected.

1 Like