We implement our application in CUDA and run it on GTX 295. Because of the 295 shortages, we thought about trying our application on 275 and 285. However, we find a bizzare behavior, our application runs much slower in gtx 285 than in gtx 275. To further study this behavior, we run the same sdk examples and the performance are as the following:
For Nbody and smokeParticles examples, the speed is generately 285>275>one gpu core of 295 which is what we expected. We observe 20% increase in Gflops and fps in gtx 285 over gtx 275. A single core of gtx295 is around the same performance as gtx275 in Nbody and 20% slower than gtx275 in smokeParticles.
BandwidthTest shows the bandwidth in MB/s as
GTX285 GTX275 single GPU in GTX295
H-D 2332 2970 3076
D-H 2051 3468 2884
D-D 127687 105090 96079
Which GTX285 shows fastest Device-Device copy bandwidth as expected. Host to device and device to host behavior seems a little weird.
The difference start to show up in transpose and transposeNew:
GTX285 GTX275 single GPU core in GTX295
Naive transpose average time: 3.367 ms 2.019 1.259 ms
Optimized transpose average time: 0.279 ms 0.167 0.183 ms
Here GTX285 is the slowest! I don’t know why the naive transpose runs the fastest on the GTX295 and why the Optimized transpose time runs the fastest on GTX 275.
TransposeNew:
[TransposeNew] [TransposeNew] [TransposeNew]
> Device 0: “GeForce GTX 285” > Device 0: “GeForce GTX 275” > Device 0: “GeForce GTX 295”
> SM Capability 1.3 detected: > SM Capability 1.3 detected: > SM Capability 1.3 detected:
> CUDA device has 30 Multi-Processors > CUDA device has 30 Multi-Processors > CUDA device has 30 Multi-Processors
> SM performance scaling factor = 1.00 > SM performance scaling factor = 1.00 > SM performance scaling factor = 1.00
Matrix size: 2048x2048 (64x64 tiles), tile size: 32x32, block size: 32x8
Kernel Loop over kernel Loop within kernel
simple copy 118.61 GB/s 121.45 GB/s 91.54 GB/s 65.88 GB/s 89.22 GB/s 64.69 GB/s
shared mem copy 94.67 GB/s 96.55 GB/s 93.45 GB/s 95.71 GB/s 83.77 GB/s 87.17 GB/s
naive transpose 2.97 GB/s 2.99 GB/s 4.76 GB/s 4.75 GB/s 4.23 GB/s 4.22 GB/s
coalesced transpose 21.83 GB/s 22.96 GB/s 65.87 GB/s 77.58 GB/s 62.67 GB/s 69.15 GB/s
no bank conflict trans 22.07 GB/s 23.00 GB/s 79.94 GB/s 78.92 GB/s 70.86 GB/s 70.22 GB/s
coarse-grained 22.05 GB/s 22.98 GB/s 74.16 GB/s 78.98 GB/s 70.97 GB/s 70.08 GB/s
fine-grained 93.81 GB/s 96.77 GB/s 92.20 GB/s 95.78 GB/s 83.61 GB/s 86.46 GB/s
diagonal transpose 88.92 GB/s 98.18 GB/s 70.38 GB/s 74.65 GB/s 62.96 GB/s 66.25 GB/s
the biggest difference is that the GTX285 has bad performance in coalesced transpose and its variants (no bank conflict and coarsegrained).
Note the difference between that and diagonal transpose, it seems there is partition camping problem. However why the gtx275 and gtx295 doesn’t have the same issue?