3x speed-up using GTX 295 over C1060 on matrix transpose in SDK, any reason? transpose in SDK

Hello,

I am comparing the results of matrix transpose example from SDK. What I found out is that by running exactly the same size of matrix (2048 X 2048) on GTX 295 and Tesla C1060. The GTX 295 will give an extra 3x speed-up than using C1060. This is a bit surprised to me to be honest. I checked the ./bandwidthTest, these two cards are quite comparable, except that GTX 295 has better performance in the device to device bandwidth, which I think is little helpful to the speeding up, not to mention that C1060 has a faster clock rate than GTX 295 per core. I am not sure if I am heading the right way, but could someone give me some points and help me understand about this?

I am attaching what I got as follow. Thanks!

[codebox]/////// Results using GTX 295

Device Numbers: 3

device 2

Using device 2: GeForce GTX 295

Transposing a 2048 by 2048 matrix of floats…

Optimized transpose average time: 0.629 ms

Time Cost: 0.629120 (ms), Bandwidth Measurement: 49.672558 (GB/s)

Test PASSED

Press ENTER to exit…[/codebox]

[codebox]/////// Results using Tesla C 1060

Device Numbers: 3

device 1

Using device 1: Tesla C1060

Transposing a 2048 by 2048 matrix of floats…

Optimized transpose average time: 1.901 ms

Time Cost: 1.900864 (ms), Bandwidth Measurement: 16.439892 (GB/s)

Test PASSED

Press ENTER to exit…[/codebox]

I just ran into this issue the other day with one of my algorithms - partition camping

Credit to tmurray for pointing this out to me:
http://forums.nvidia.com/index.php?showtopic=96423

The GTX295 should be a bit faster - it has higher clocks than the C1060, but the difference should be maybe 20%, not a factor of three. It is quite possibly partition camping. The C1060 has a 512 pin memory interface (so 8 banks), whereas the GTX295 has a 448 pin (so 7 banks). Unequal distribution of global memory access across the dram banks can call what NVIDIA have christened partition camping, which can reduce the peak achievable global memory bandwidth. There is a discussion about it in some of the 2.3 toolkit documentation, probably the best practices guide IIRC.

Hi apangborn, avidday,

Thanks for pointing this reference out, I will go read it.