Roofline Model for Nvidia GTX1080

I am trying to profile a CuBLAS Single Precision GEMM with different square matrix sizes from n=256 to 16K, and I monitor Floating Point utilization, execution time and Dram read throughput. I am getting different results for utilisation vs throughput, since utilization is saturated at n=2048 while throughput keeps going till n=8K, any ideas about why the throughput is not saturated with FPUs utilization?
Profiling is done using nvprof and these metrics were used to report:
–print-gpu-trace for kernel execution time.

I am attaching figures that describe the problem
P.S. execution time shows limitation starting 8K.