Observed performance difference between gtx 275 285 and 295 A different behavior of gtx 285

We implement our application in CUDA and run it on GTX 295. Because of the 295 shortages, we thought about trying our application on 275 and 285. However, we find a bizzare behavior, our application runs much slower in gtx 285 than in gtx 275. To further study this behavior, we run the same sdk examples and the performance are as the following:

For Nbody and smokeParticles examples, the speed is generately 285>275>one gpu core of 295 which is what we expected. We observe 20% increase in Gflops and fps in gtx 285 over gtx 275. A single core of gtx295 is around the same performance as gtx275 in Nbody and 20% slower than gtx275 in smokeParticles.

BandwidthTest shows the bandwidth in MB/s as
GTX285 GTX275 single GPU in GTX295
H-D 2332 2970 3076
D-H 2051 3468 2884
D-D 127687 105090 96079

Which GTX285 shows fastest Device-Device copy bandwidth as expected. Host to device and device to host behavior seems a little weird.

The difference start to show up in transpose and transposeNew:
GTX285 GTX275 single GPU core in GTX295
Naive transpose average time: 3.367 ms 2.019 1.259 ms
Optimized transpose average time: 0.279 ms 0.167 0.183 ms
Here GTX285 is the slowest! I don’t know why the naive transpose runs the fastest on the GTX295 and why the Optimized transpose time runs the fastest on GTX 275.

[TransposeNew] [TransposeNew] [TransposeNew]
> Device 0: “GeForce GTX 285” > Device 0: “GeForce GTX 275” > Device 0: “GeForce GTX 295”
> SM Capability 1.3 detected: > SM Capability 1.3 detected: > SM Capability 1.3 detected:
> CUDA device has 30 Multi-Processors > CUDA device has 30 Multi-Processors > CUDA device has 30 Multi-Processors
> SM performance scaling factor = 1.00 > SM performance scaling factor = 1.00 > SM performance scaling factor = 1.00
Matrix size: 2048x2048 (64x64 tiles), tile size: 32x32, block size: 32x8
Kernel Loop over kernel Loop within kernel
simple copy 118.61 GB/s 121.45 GB/s 91.54 GB/s 65.88 GB/s 89.22 GB/s 64.69 GB/s
shared mem copy 94.67 GB/s 96.55 GB/s 93.45 GB/s 95.71 GB/s 83.77 GB/s 87.17 GB/s
naive transpose 2.97 GB/s 2.99 GB/s 4.76 GB/s 4.75 GB/s 4.23 GB/s 4.22 GB/s
coalesced transpose 21.83 GB/s 22.96 GB/s 65.87 GB/s 77.58 GB/s 62.67 GB/s 69.15 GB/s
no bank conflict trans 22.07 GB/s 23.00 GB/s 79.94 GB/s 78.92 GB/s 70.86 GB/s 70.22 GB/s
coarse-grained 22.05 GB/s 22.98 GB/s 74.16 GB/s 78.98 GB/s 70.97 GB/s 70.08 GB/s
fine-grained 93.81 GB/s 96.77 GB/s 92.20 GB/s 95.78 GB/s 83.61 GB/s 86.46 GB/s
diagonal transpose 88.92 GB/s 98.18 GB/s 70.38 GB/s 74.65 GB/s 62.96 GB/s 66.25 GB/s

the biggest difference is that the GTX285 has bad performance in coalesced transpose and its variants (no bank conflict and coarsegrained).
Note the difference between that and diagonal transpose, it seems there is partition camping problem. However why the gtx275 and gtx295 doesn’t have the same issue?

GTX 275 and GTX 295 each have 7 partitions instead of 8, which means they are not nearly as susceptible to partition camping from data structures with power-of-two dimensions as GTX 285.

Thanks. That explains the results I got. And how do I figure out the number of partitions of a card? By dividing the memory by 128MB?

Divide the memory controller pin/bit count by 64. So a 512 bit card like the GTX285 or Tesla C1060 has 512/64 = 8 partitions, a 448 bit card like the GTX275 has 448/64 = 7 partitions.

It may seem odd that the post that left me quizzical is also the shortest. Would some one mind explaining it in a little more detail as to what this means. At what level of hierarchy do the aforesaid partitions fall talking in terms of GPC–>SMs–>Cuda cores .

Thanks for the effort :)

You can read more about partition camping in the whitepaper inside the matrixTranspose example in the CUDA SDK, page 16 to be precise.
Partition camping is not a problem on Fermi GPU.