CUBLAS 3.0 DGEMM performance on Tesla Fermi

Is anyone else getting drastic performance difference (greater than 100GFLOPS) in CUBLAS 3.0 DGEMM between matrix dimensions that are multiple of 48 and matrix dimensions that are non-multiple of 48?

This is true irregardless of your GPU. CUBLAS’s implementation of DGEMM’s has a “sweet spot” at this size.