Hello,

I am comparing the results of matrix transpose example from SDK. What I found out is that by running exactly the same size of matrix (2048 X 2048) on GTX 295 and Tesla C1060. The GTX 295 will give an extra 3x speed-up than using C1060. This is a bit surprised to me to be honest. I checked the ./bandwidthTest, these two cards are quite comparable, except that GTX 295 has better performance in the device to device bandwidth, which I think is little helpful to the speeding up, not to mention that C1060 has a faster clock rate than GTX 295 per core. I am not sure if I am heading the right way, but could someone give me some points and help me understand about this?

I am attaching what I got as follow. Thanks!

[codebox]/////// Results using GTX 295

Device Numbers: 3

device 2

Using device 2: GeForce GTX 295

Transposing a 2048 by 2048 matrix of floats…

Optimized transpose average time: 0.629 ms

Time Cost: 0.629120 (ms), Bandwidth Measurement: 49.672558 (GB/s)

Test PASSED

Press ENTER to exit…[/codebox]

[codebox]/////// Results using Tesla C 1060

Device Numbers: 3

device 1

Using device 1: Tesla C1060

Transposing a 2048 by 2048 matrix of floats…

Optimized transpose average time: 1.901 ms

Time Cost: 1.900864 (ms), Bandwidth Measurement: 16.439892 (GB/s)

Test PASSED

Press ENTER to exit…[/codebox]