I ran cuda-11.2 nsight-compute on my cuda kernel.
It reports that SOL SM is at 79.44% which I interpret as being pretty close to maximum. SOL L1 is at 48.38%
When I examine the Roofline chart, I see that my measured result is very far away from peak performance.
Achieved: 4.7 GFlop/s.
Peak at roofline is 93 GFlop/s or so.
I also see ALU pipe utilization at 80+%
So, if the ALU pipe is fully utilized, why is the achieved performance so much lower, according to the roofline chart?