instruction throughput for CUBLAS 3 functions

I am trying to call the CUBLAS3 functions and I got the following throughput rate in Visual Profiler:

Method Calls GPU usec GPU time% instruction throughput
dgemm_main_hw_na_nb 730 1.08037e+06 43.88 0.0216612
dgemm_main_hw_ta_nb 365 823865 33.46 0

As required I have already made the matrix to be stored as column-major. Anyone got ideas why I am getting such a throughput?