How exactly do I interpret the profiling information given by running the matrix multiplication sample examples? I mean, how does the MFLOPS result correspond with the time taken for exectuion in ms?
MFLOPS is throughput metric describing the amount of work that can be performed in a given time period. Since profiling always involves some overhead (some methods more than others), I would expect the MFLOPS to lower in the case of MATMUL.
Note that profiling is best used for finding relative performance between different parts of your code not for overall performance.