My gtx 470 arrived last week and today I decided to run some test. I used the nbody example in sdk and compare the result with GTX 295 (running on single gpu). I post my results below. There are also interesting result when I run the binary from the old 2.2 and 2.3 sdks.
- GTX 470 running the 3.0 sdk binary: nbody -benchmark
Run “nbody -benchmark [-n=]” to measure perfomance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64 (use double precision floating point values for simulation)
Windowed mode
Simulation data stored in video memory
Single precision floating point simulation
Compute 2.0 CUDA device: [GeForce GTX 470]
14336 bodies, total time for 10 iterations: 79.549 ms
= 25.836 billion interactions per second
= 516.715 single-precision GFLOP/s at 20 flops per interaction
- GTX 470 running the 2.3 sdk binary: nbody -benchmark
Run “nbody -benchmark [-n=]” to measure perfomance.
14336 bodies, total time for 100 iterations: 677.278 ms
= 30.345 billion interactions per second
= 606.903 GFLOP/s at 20 flops per interaction
- GTX 470 running the 2.2 sdk binary: nbody -benchmark
Run “nbody -benchmark -n=” to measure perfomance.
14336 bodies, total time for 100 iterations: 673.859 ms
= 30.499 billion interactions per second
= 609.982 GFLOP/s at 20 flops per interaction
Interesting, since the old binaries runs faster than the most recent sdk 3.0. I remember someone else noted that single precision performance is better when compile with sm=1.3 compare with sm=2.0. Maybe this is the same behavior?
We have another machine with GTX 295 installed, so we run the test there as well, to check how fast GTx 470 compare with GTx 295
- one core of GTX 295 running 3.0 sdk binary: nbody -benchmark
Run “nbody -benchmark [-n=]” to measure perfomance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64 (use double precision floating point values for simulation)
Windowed mode
Simulation data stored in video memory
Single precision floating point simulation
Compute 1.3 CUDA device: [GeForce GTX 295]
30720 bodies, total time for 10 iterations: 425.482 ms
= 22.180 billion interactions per second
= 443.600 single-precision GFLOP/s at 20 flops per interaction
- one core of GTX 295 running 2.3 sdk binary: nbody -benchmark
Run “nbody -benchmark [-n=]” to measure perfomance.
30720 bodies, total time for 100 iterations: 4998.113 ms
= 18.881 billion interactions per second
= 377.630 GFLOP/s at 20 flops per interaction
- one core of GTX 295 running 2.2 sdk binary: nbody -benchmark
Run “nbody -benchmark -n=” to measure perfomance.
30720 bodies, total time for 100 iterations: 4304.157 ms
= 21.926 billion interactions per second
= 438.515 GFLOP/s at 20 flops per interaction
Here, things are more interesting. sdk3.0 binary is almost as fast as the sdk2.2 binary, and the sdk2.3 binary is nearly 18% slower! I was intended to check how GTX 470 compare with GTX 295, but I didn’t know it depends on the binary as well. What has changed between 2.2 and 2.3?
I am also disappoint to find that the gtx 470 is not much faster (10-25% at most?) than a single core of gtx 295 although it has more cores (448 vs 240). So it seems that one GTX 480 can’t beat the performance of GTX 295 (with both cores running), at least in single precision.
I ran the double precision benchmark,
on GTX 480:
Windowed mode
Simulation data stored in video memory
Double precision floating point simulation
Compute 2.0 CUDA device: [GeForce GTX 470]
14336 bodies, total time for 10 iterations: 643.409 ms
= 3.194 billion interactions per second
= 95.828 double-precision GFLOP/s at 30 flops per interaction
on a single core of GTX 295:
Windowed mode
Simulation data stored in video memory
Double precision floating point simulation
Compute 1.3 CUDA device: [GeForce GTX 295]
30720 bodies, total time for 10 iterations: 5868.665 ms
= 1.608 billion interactions per second
= 48.242 double-precision GFLOP/s at 30 flops per interaction
So the GTX 295 have similiar DP performance as the GTX 470.
I also ran transposeNew example in sdk 3.0. Note that the sdk 3.0’s transposeNew using different configuration than the 2.3 or 2.2 sdks. I only list a few things of my interest.
Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16
GTX 295:
transposeNew-Outer-naive transpose , Throughput = 8.4393 GB/s, Time = 0.92573 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transposeNew-Inner-naive transpose , Throughput = 8.5637 GB/s, Time = 0.91228 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
GTX 470:
transposeNew-Outer-naive transpose , Throughput = 46.0231 GB/s, Time = 0.02387 s, Size = 147456 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transposeNew-Inner-naive transpose , Throughput = 71.4125 GB/s, Time = 0.01538 s, Size = 147456 fp32 elements, NumDevsUsed = 1, Workgroup = 256
Ok, here, we can see the cache helps a lot in naive transpose, so you don’t pay big penalty if you don’t pay attention to those details.
GTX 295:
transposeNew-Outer-no bank conflict trans, Throughput = 51.2110 GB/s, Time = 0.15256 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transposeNew-Inner-no bank conflict trans, Throughput = 56.4173 GB/s, Time = 0.13848 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
GTX 470:
transposeNew-Outer-no bank conflict trans, Throughput = 73.6681 GB/s, Time = 0.01491 s, Size = 147456 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transposeNew-Inner-no bank conflict trans, Throughput = 202.6101 GB/s, Time = 0.00542 s, Size = 147456 fp32 elements, NumDevsUsed = 1, Workgroup = 256
Here, the no bank conflict transpose is the fastest implementation on both gpu. I don’t know if those inner loop scenario number matters to us. If we only look at outer loop case, the throughput is gtx 470’s 73.7 GB/s vs the GTX 295’s 51.2x2=102.4 GB/s. I don’t think GTx 480 will surpass GTX 295 in this test as well.
So my gut feeling is, the features that GF100 offers like DP and cache which will help some people to tackle their problem that requires better DP performance, and help some people to get decent performance without worry too much about optimizing code. But for an already optimized single-precision CUDA program, GTX 295 is the best choice.