Metrics divergence on sgemm vs matrixMul

Hi everyone,

I’m using nvprof on K20c to profile two types of matrix multiplication. I profile cuBLAS sgemm and matrixMul from NVIDIA samples (samples/0_Simple/matrixMul/).
I’m profiling two metrics flop_count_sp_fma and inst_compute_ld_st, using the same input for both implementations. The matrices A and B have 4096x4096 elements.

The results for sgemm and matrixMul are:

sgemm: 6.87197E+10 FMA instructions
matrixMul: 6.87363E+10 FMA instructions
#matrixMul / #sgemm_fma = 0.9998 almost the same

sgemm: 3.92E+09 load/store instructions
matrixMul: 9.45E+10 load/store instructions
#matrixMul_ld_st / #sgemm_ld_st = 24x more instructions?

I was expecting a difference between the number of load/store, but 24x seems too much. Is this LD/ST variation between the two versions of matrix multiplication normal?