CUBLAS SGEMM Flops measurement using nvprof on Volta

Measurement of flops using nvprof for SGEMM 8Kx8Kx8K

steps:

  1. Run CUBLAS based SGEMM (ie mulitiply two random matrices of size 8Kx8K and 8Kx8K) on Volta.
  2. Set operating frequency of Volta to 1200 MHz.
  3. Measure time using nvprof print-gpu-trace option.(87mS)
  4. Measure number of instructions executed using --flop_count_sp metric
  5. compute Actual flops = operations/Time ===>(12561 GLOPS)
  6. Theoretical Flops of Volta = 5120(number of cuda cores) * 1200 (frequency) * 2 ==>(12288 GLOPS)

Problem:
Measured flops is greater than Theoretical flops.(102%).
This issue is seen only for bigger matrix sizes, seems to be less than 100% for smaller matrix sizes.

Any idea what could explain this behaviour?

Regards,