SOL SM and Roofline seem to contradict?

I ran cuda-11.2 nsight-compute on my cuda kernel.

It reports that SOL SM is at 79.44% which I interpret as being pretty close to maximum. SOL L1 is at 48.38%

When I examine the Roofline chart, I see that my measured result is very far away from peak performance.

Achieved: 4.7 GFlop/s.
Peak at roofline is 93 GFlop/s or so.

I also see ALU pipe utilization at 80+%

So, if the ALU pipe is fully utilized, why is the achieved performance so much lower, according to the roofline chart?

Ok, I think I solved it…

Roofline charts only show fp32 and fp64 operations. My code uses half precision floats: fp16, and thus the roofline chart is of little use in my case.

I guess that could be a feature request: add fp16 data to the roofline chart.

Yes we have fp16 support in the roofline chart in the list of feature requests.