Hello,
I am comparing the results of matrix transpose example from SDK. What I found out is that by running exactly the same size of matrix (2048 X 2048) on GTX 295 and Tesla C1060. The GTX 295 will give an extra 3x speed-up than using C1060. This is a bit surprised to me to be honest. I checked the ./bandwidthTest, these two cards are quite comparable, except that GTX 295 has better performance in the device to device bandwidth, which I think is little helpful to the speeding up, not to mention that C1060 has a faster clock rate than GTX 295 per core. I am not sure if I am heading the right way, but could someone give me some points and help me understand about this?
I am attaching what I got as follow. Thanks!
[codebox]/////// Results using GTX 295
Device Numbers: 3
device 2
Using device 2: GeForce GTX 295
Transposing a 2048 by 2048 matrix of floats…
Optimized transpose average time: 0.629 ms
Time Cost: 0.629120 (ms), Bandwidth Measurement: 49.672558 (GB/s)
Test PASSED
Press ENTER to exit…[/codebox]
[codebox]/////// Results using Tesla C 1060
Device Numbers: 3
device 1
Using device 1: Tesla C1060
Transposing a 2048 by 2048 matrix of floats…
Optimized transpose average time: 1.901 ms
Time Cost: 1.900864 (ms), Bandwidth Measurement: 16.439892 (GB/s)
Test PASSED
Press ENTER to exit…[/codebox]