I was also able to get the following results on a L4 (in Google Cloud):
FP8 with FP32 accumulate: 188 TFLOPS
FP16 with FP32 accumulate: 87 TFLOPS
FP16 with FP16 accumulate: 85 TFLOPS
INT8 with INT32 accumulate: 165 TOPS
So relatively speaking the FP8 performance is more like INT8 here, as expected.