Measurement of flops using nvprof for SGEMM 8Kx8Kx8K
steps:
- Run CUBLAS based SGEMM (ie mulitiply two random matrices of size 8Kx8K and 8Kx8K) on Volta.
- Set operating frequency of Volta to 1200 MHz.
- Measure time using nvprof print-gpu-trace option.(87mS)
- Measure number of instructions executed using --flop_count_sp metric
- compute Actual flops = operations/Time ===>(12561 GLOPS)
- Theoretical Flops of Volta = 5120(number of cuda cores) * 1200 (frequency) * 2 ==>(12288 GLOPS)
Problem:
Measured flops is greater than Theoretical flops.(102%).
This issue is seen only for bigger matrix sizes, seems to be less than 100% for smaller matrix sizes.
Any idea what could explain this behaviour?