Fermi DGEMM

Would it be a better idea to store both of the input matrices into shared memory given 1) increase in shared memory in Fermi and 2) higher ratio between DP peak performance number and memory bandwidth in Fermi compared to pre-Fermi. Any initial findings?