What's the measured perf. of MMM with lowlevel API

MMM: matrix matrix multiplication
As far as I see, with CUDA. MMM 's peak is about 200GFLOPS.
With lower level API, does the performance get even better?