Throughput drops after saturation with more threads

Theoretically, when the number of threads in SM increases until it reaches the peak throughput, the throughput is supposed to be saturated, meaning further increasing the threads, no acceleration gain and the throughput line should be flat.

Observe from this figure, the throughput goes up at first linearly and when it’s about to flat, it drops to a concave. This figure is from page 23 “Better Performance at Lower Occupancy” http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf.

My question is why there is a concave in this figure instead of being flat.

P.S. the kernel is as follows

#pragma unroll UNROLL
    for( int i = 0; i < N_ITERATIONS; i++ ) 
    { 
    a = a * b + c; 
    }

I would be grateful for any comments.

the kernel hardly access global memory - mostly (only) local memory, if I am not mistaken

how many thread blocks are used?

The curve may equally be impacted by the change in thread blocks running concurrently, in turn impacted by local memory requirements and optimization of spilling, as the number of threads increase, I would think;

a, b and c in the kernel are in registers. It runs only 1 block.
Yes. I think there might be some optimizations or something affect the curve.

Thanks for your comment. jimmy.
a, b and c in the kernel are in registers. It runs only 1 block.
Yes. I think there might be some optimizations or something affect the curve.
I wonder if anyone had the same problem before, or any other additional answers which explain the concave are welcome.

the number of warps is increased too; and that may also impact eventual execution in the sense of what gets completed when; for one, it should impact scheduling by the schedulers