Is anyone else getting drastic performance difference (greater than 100GFLOPS) in CUBLAS 3.0 DGEMM between matrix dimensions that are multiple of 48 and matrix dimensions that are non-multiple of 48?
This is true irregardless of your GPU. CUBLAS’s implementation of DGEMM’s has a “sweet spot” at this size.