Hello All. Can anyone explain why 680 gtx card is about two times slower than 580 gtx card, talking about MatrixTranspose sample from SDK? Does partition camping take place for Kepler family?

The performance reported by MatrixTranspose seems to depend sensitively on the shape of the block used. With some quick fiddling, I was able to change the performance reported quite a bit for both devices. For the simple copy case, Increasing the tile size by a factor of 2 in each dimension (and testing with larger arrays) seemed to help the GTX 680 reach nearly the same performance as the GTX 580, which isn’t surprising given the much larger number of CUDA cores per multiprocessor.

I have generally found that the GTX 680 is easy to underutilize if you use the same block configuration as with older GPUs.