Observed performance difference between gtx 275 285 and 295 A different behavior of gtx 285

springc · March 5, 2010, 9:32pm

We implement our application in CUDA and run it on GTX 295. Because of the 295 shortages, we thought about trying our application on 275 and 285. However, we find a bizzare behavior, our application runs much slower in gtx 285 than in gtx 275. To further study this behavior, we run the same sdk examples and the performance are as the following:

For Nbody and smokeParticles examples, the speed is generately 285>275>one gpu core of 295 which is what we expected. We observe 20% increase in Gflops and fps in gtx 285 over gtx 275. A single core of gtx295 is around the same performance as gtx275 in Nbody and 20% slower than gtx275 in smokeParticles.

BandwidthTest shows the bandwidth in MB/s as
GTX285 GTX275 single GPU in GTX295
H-D 2332 2970 3076
D-H 2051 3468 2884
D-D 127687 105090 96079

Which GTX285 shows fastest Device-Device copy bandwidth as expected. Host to device and device to host behavior seems a little weird.

The difference start to show up in transpose and transposeNew:
GTX285 GTX275 single GPU core in GTX295
Naive transpose average time: 3.367 ms 2.019 1.259 ms
Optimized transpose average time: 0.279 ms 0.167 0.183 ms
Here GTX285 is the slowest! I don’t know why the naive transpose runs the fastest on the GTX295 and why the Optimized transpose time runs the fastest on GTX 275.

TransposeNew:
[TransposeNew] [TransposeNew] [TransposeNew]
> Device 0: “GeForce GTX 285” > Device 0: “GeForce GTX 275” > Device 0: “GeForce GTX 295”
> SM Capability 1.3 detected: > SM Capability 1.3 detected: > SM Capability 1.3 detected:
> CUDA device has 30 Multi-Processors > CUDA device has 30 Multi-Processors > CUDA device has 30 Multi-Processors
> SM performance scaling factor = 1.00 > SM performance scaling factor = 1.00 > SM performance scaling factor = 1.00
Matrix size: 2048x2048 (64x64 tiles), tile size: 32x32, block size: 32x8
Kernel Loop over kernel Loop within kernel
simple copy 118.61 GB/s 121.45 GB/s 91.54 GB/s 65.88 GB/s 89.22 GB/s 64.69 GB/s
shared mem copy 94.67 GB/s 96.55 GB/s 93.45 GB/s 95.71 GB/s 83.77 GB/s 87.17 GB/s
naive transpose 2.97 GB/s 2.99 GB/s 4.76 GB/s 4.75 GB/s 4.23 GB/s 4.22 GB/s
coalesced transpose 21.83 GB/s 22.96 GB/s 65.87 GB/s 77.58 GB/s 62.67 GB/s 69.15 GB/s
no bank conflict trans 22.07 GB/s 23.00 GB/s 79.94 GB/s 78.92 GB/s 70.86 GB/s 70.22 GB/s
coarse-grained 22.05 GB/s 22.98 GB/s 74.16 GB/s 78.98 GB/s 70.97 GB/s 70.08 GB/s
fine-grained 93.81 GB/s 96.77 GB/s 92.20 GB/s 95.78 GB/s 83.61 GB/s 86.46 GB/s
diagonal transpose 88.92 GB/s 98.18 GB/s 70.38 GB/s 74.65 GB/s 62.96 GB/s 66.25 GB/s

the biggest difference is that the GTX285 has bad performance in coalesced transpose and its variants (no bank conflict and coarsegrained).
Note the difference between that and diagonal transpose, it seems there is partition camping problem. However why the gtx275 and gtx295 doesn’t have the same issue?

tmurray · March 5, 2010, 10:04pm

GTX 275 and GTX 295 each have 7 partitions instead of 8, which means they are not nearly as susceptible to partition camping from data structures with power-of-two dimensions as GTX 285.

springc · March 5, 2010, 11:13pm

Thanks. That explains the results I got. And how do I figure out the number of partitions of a card? By dividing the memory by 128MB?

avidday · March 5, 2010, 11:41pm

Divide the memory controller pin/bit count by 64. So a 512 bit card like the GTX285 or Tesla C1060 has 512/64 = 8 partitions, a 448 bit card like the GTX275 has 448/64 = 7 partitions.

sherinkapotein · September 23, 2011, 6:12pm

It may seem odd that the post that left me quizzical is also the shortest. Would some one mind explaining it in a little more detail as to what this means. At what level of hierarchy do the aforesaid partitions fall talking in terms of GPC–>SMs–>Cuda cores .

Thanks for the effort :)

mfatica · September 23, 2011, 6:20pm

You can read more about partition camping in the whitepaper inside the matrixTranspose example in the CUDA SDK, page 16 to be precise.
Partition camping is not a problem on Fermi GPU.

Topic		Replies	Views
3x speed-up using GTX 295 over C1060 on matrix transpose in SDK, any reason? transpose in SDK CUDA Programming and Performance	3	4593	February 15, 2010
no bank conflicts on gtx285 or partition camping on quadro nvs 140m? CUDA Programming and Performance	1	3221	March 27, 2010
why am I not seeing bank conflict effects on a gtx 285? CUDA Programming and Performance	3	1572	April 2, 2010
WHY GTX295 slower as FX1700 CUDA Programming and Performance	9	3040	March 26, 2010
GeForce GTX 295 vs. 285 for CUDA development CUDA Programming and Performance	4	8531	August 11, 2009
MatrixTranspose CUDA Programming and Performance	1	760	December 28, 2012
What will be like for GTX 295 CUDA Programming and Performance	2	1536	July 23, 2009
GTX295 question CUDA Programming and Performance	11	10163	May 10, 2009
GTX 470 vs GTX 295 benchmark using sdk examples comparison between GTX 470 and GTX 295 in sdk 2.2 2. CUDA Programming and Performance	15	46628	May 6, 2010
New transpose white paper Partition camping explained CUDA Programming and Performance	4	11058	January 13, 2010

Observed performance difference between gtx 275 285 and 295 A different behavior of gtx 285

Related topics