GTX 470 vs GTX 295 benchmark using sdk examples comparison between GTX 470 and GTX 295 in sdk 2.2 2.

My gtx 470 arrived last week and today I decided to run some test. I used the nbody example in sdk and compare the result with GTX 295 (running on single gpu). I post my results below. There are also interesting result when I run the binary from the old 2.2 and 2.3 sdks.

  1. GTX 470 running the 3.0 sdk binary: nbody -benchmark
    Run “nbody -benchmark [-n=]” to measure perfomance.
    -fullscreen (run n-body simulation in fullscreen mode)
    -fp64 (use double precision floating point values for simulation)

Windowed mode
Simulation data stored in video memory
Single precision floating point simulation
Compute 2.0 CUDA device: [GeForce GTX 470]
14336 bodies, total time for 10 iterations: 79.549 ms
= 25.836 billion interactions per second
= 516.715 single-precision GFLOP/s at 20 flops per interaction

  1. GTX 470 running the 2.3 sdk binary: nbody -benchmark
    Run “nbody -benchmark [-n=]” to measure perfomance.

14336 bodies, total time for 100 iterations: 677.278 ms
= 30.345 billion interactions per second
= 606.903 GFLOP/s at 20 flops per interaction

  1. GTX 470 running the 2.2 sdk binary: nbody -benchmark
    Run “nbody -benchmark -n=” to measure perfomance.

14336 bodies, total time for 100 iterations: 673.859 ms
= 30.499 billion interactions per second
= 609.982 GFLOP/s at 20 flops per interaction

Interesting, since the old binaries runs faster than the most recent sdk 3.0. I remember someone else noted that single precision performance is better when compile with sm=1.3 compare with sm=2.0. Maybe this is the same behavior?

We have another machine with GTX 295 installed, so we run the test there as well, to check how fast GTx 470 compare with GTx 295

  1. one core of GTX 295 running 3.0 sdk binary: nbody -benchmark
    Run “nbody -benchmark [-n=]” to measure perfomance.
    -fullscreen (run n-body simulation in fullscreen mode)
    -fp64 (use double precision floating point values for simulation)

Windowed mode
Simulation data stored in video memory
Single precision floating point simulation
Compute 1.3 CUDA device: [GeForce GTX 295]
30720 bodies, total time for 10 iterations: 425.482 ms
= 22.180 billion interactions per second
= 443.600 single-precision GFLOP/s at 20 flops per interaction

  1. one core of GTX 295 running 2.3 sdk binary: nbody -benchmark
    Run “nbody -benchmark [-n=]” to measure perfomance.

30720 bodies, total time for 100 iterations: 4998.113 ms
= 18.881 billion interactions per second
= 377.630 GFLOP/s at 20 flops per interaction

  1. one core of GTX 295 running 2.2 sdk binary: nbody -benchmark
    Run “nbody -benchmark -n=” to measure perfomance.

30720 bodies, total time for 100 iterations: 4304.157 ms
= 21.926 billion interactions per second
= 438.515 GFLOP/s at 20 flops per interaction

Here, things are more interesting. sdk3.0 binary is almost as fast as the sdk2.2 binary, and the sdk2.3 binary is nearly 18% slower! I was intended to check how GTX 470 compare with GTX 295, but I didn’t know it depends on the binary as well. What has changed between 2.2 and 2.3?

I am also disappoint to find that the gtx 470 is not much faster (10-25% at most?) than a single core of gtx 295 although it has more cores (448 vs 240). So it seems that one GTX 480 can’t beat the performance of GTX 295 (with both cores running), at least in single precision.

I ran the double precision benchmark,

on GTX 480:

Windowed mode
Simulation data stored in video memory
Double precision floating point simulation
Compute 2.0 CUDA device: [GeForce GTX 470]
14336 bodies, total time for 10 iterations: 643.409 ms
= 3.194 billion interactions per second
= 95.828 double-precision GFLOP/s at 30 flops per interaction

on a single core of GTX 295:

Windowed mode
Simulation data stored in video memory
Double precision floating point simulation
Compute 1.3 CUDA device: [GeForce GTX 295]
30720 bodies, total time for 10 iterations: 5868.665 ms
= 1.608 billion interactions per second
= 48.242 double-precision GFLOP/s at 30 flops per interaction

So the GTX 295 have similiar DP performance as the GTX 470.

I also ran transposeNew example in sdk 3.0. Note that the sdk 3.0’s transposeNew using different configuration than the 2.3 or 2.2 sdks. I only list a few things of my interest.
Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16

GTX 295:
transposeNew-Outer-naive transpose , Throughput = 8.4393 GB/s, Time = 0.92573 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transposeNew-Inner-naive transpose , Throughput = 8.5637 GB/s, Time = 0.91228 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

GTX 470:
transposeNew-Outer-naive transpose , Throughput = 46.0231 GB/s, Time = 0.02387 s, Size = 147456 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transposeNew-Inner-naive transpose , Throughput = 71.4125 GB/s, Time = 0.01538 s, Size = 147456 fp32 elements, NumDevsUsed = 1, Workgroup = 256

Ok, here, we can see the cache helps a lot in naive transpose, so you don’t pay big penalty if you don’t pay attention to those details.

GTX 295:
transposeNew-Outer-no bank conflict trans, Throughput = 51.2110 GB/s, Time = 0.15256 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transposeNew-Inner-no bank conflict trans, Throughput = 56.4173 GB/s, Time = 0.13848 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

GTX 470:
transposeNew-Outer-no bank conflict trans, Throughput = 73.6681 GB/s, Time = 0.01491 s, Size = 147456 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transposeNew-Inner-no bank conflict trans, Throughput = 202.6101 GB/s, Time = 0.00542 s, Size = 147456 fp32 elements, NumDevsUsed = 1, Workgroup = 256

Here, the no bank conflict transpose is the fastest implementation on both gpu. I don’t know if those inner loop scenario number matters to us. If we only look at outer loop case, the throughput is gtx 470’s 73.7 GB/s vs the GTX 295’s 51.2x2=102.4 GB/s. I don’t think GTx 480 will surpass GTX 295 in this test as well.

So my gut feeling is, the features that GF100 offers like DP and cache which will help some people to tackle their problem that requires better DP performance, and help some people to get decent performance without worry too much about optimizing code. But for an already optimized single-precision CUDA program, GTX 295 is the best choice.

SDK samples are not real benchmarks. I can pick some benchmarks where 470 will obliterate both GPUs of a 295 due to the additional shared memory and L2, and I could also pick some other benchmarks where it loses.

Also, it’s very wrong to compare two GPUs versus a single GPU and pretend that doubling the performance of one GPU of the 295 will get you cumulative performance. PCIe matters a lot, avoiding additional PCIe matters a lot, additional synchronization to move data around matters a lot, etc. Multi-GPU programming is not necessarily trivial.

There is no denying that the GF100 have lots of innovations that will benifit a lot of applications. I ran those test because our application get very poor performance when I test it in gtx 470. It is about twice as slow as a single GPU in GTx 295. Our problem scales pretty well. We use MPI to manage different GPUs and we only do host to device and device to host once. We don’t use share memory at all as the problem size is too big. We read from global memory and we make sure coalesced memory access exists.

One thing is that the binary of our application is compiled with cuda 2.1 ( or 2.2) since we are using visual studio 2003. I am going to install VS2008 and compile it again. It is not a trivial work since our code base was configured to compile under VS 2003. I am not sure if recompile under CUDA 3.0 will help us a lot. So I tried those sdk examples first to get an idea of 1. whether recompile make much difference, 2. what performance I should be expecting.

Actually, we expected that our application won’t be running faster in GTX 480 than GTX 295. As the memory bandwidth actually is only half of GTX 295, and our application is memory bound. We just hope the GTX 480 could offer performance that is close to GTX 295, or we can buy GTX 295 in thousands from somewhere.

Certainly the GTX 295 (when you can find it) is a great deal for linearly scaling problems, and I think it will be a while before the GTX 400 series can beat it for some problems. Since it sounds like your application scales well with additional GPUs, you want to maximize memory bandwidth, and you are focused on single precision, I have a slightly crazy sounding suggestion:

Have you considered the GT240? It seems to be available in quantity, is 60% better than the GTX 295 in GB/sec/$ and is slightly better in GB/sec/W. In trade, you give up double precision, and have about 15% less GFLOPS per GB/sec. (Although, if you are already memory bound on a GTX 295, maybe it won’t matter.) If you can find motherboards with multiple NF200 PCI-Express switches on them, you might even be able to cram 8 GT240s per node and get the same PCI-E bandwidth characteristics per GPU that the GTX295 does.

Of course, you will need more nodes to achieve the same total memory bandwidth, which might eat up the 60% cost benefit. But it is worth running the numbers, and also factoring in the apparently short supply of GT200 chips.

springc what is your block size? And are you using textures? Probably not, is you get coalesced access.

Thanks for your suggestions.

Is GT240 better in GB/sec/$? From wikipedia, the GDDR5 version has 54.4GiB/s and 385 GFlops, but it cost ~$100 in amazon.

compare with GTX295’s 2x111.9 = 224 GiB/s and 1788 GFlops and cost about $500-$550, didn’t see great improvement.

And our problem is not entirely run on GPU so it is not perfectly linear overall. We test it on a machine with 4 GTX 295. while 2 card is 50-80% faster than 1 card, 4 card is 180-250% times faster than 1 card. When running on 1 gtx 295 card, about 20% time spent on CPU and 80% on GPU.

As for multi-card alternative to the GTX 295, I was thinking of GTX 275 or GTX 260. Seems it is still available (but I don’t know how long it will last). We tested GTX 275 and 285 and believe 275 is the best choice since we seems to suffer from partition camping for some cases when using 285.

Any info about whether NVIDIA will ever produce new chips for GTX 295? and what’s the story about 275 and 260?

Oops, no, you’re right. I was fooled by the GDDR3 prices, which are more like $85.

All of these cards use the same GPU, just with some of the multiprocessors disabled, so the “shortage” of GTX 295s I think is mostly due to manufacturers shunting their limited supplies to better selling models, like the GTX 275. I don’t know if the GT200 production is permanently stopped, or just reduced, but if you are serious about making a large purchase, you should probably start talking to vendors directly and see if they can shed any light on the supply situation.

There are several piece of works done in GPU. One kernel has block size of 128 and total number of threads on the order of 65536-131072. No texture access. Everything is in global memory and coalesced. The computation is very little. About a dozen arithmetic operations compare to a few memory reads and one write. We tested to put some data in texture but it doesn’t give us better performance. I think it is because our algorithm only access each memory once so cache doesn’t help (and we make sure coalesced access so prefetching doesn’t help much either). If we use the simple copy case in the transposeNew.

GTX 470:

transposeNew-Outer-simple copy , Throughput = 75.1862 GB/s,

transposeNew-Outer-shared memory copy , Throughput = 70.8715 GB/s

GTX 295:

transposeNew-Outer-simple copy , Throughput = 87.3728 GB/s,

transposeNew-Outer-shared memory copy , Throughput = 43.0561 GB/s,

a single core of GTX 295 is faster than GTX 470, and shared memory doesn’t help in this case.

Another kernel, could benefit from the new archeture if we re-write it. It reads from texture for the out of bound handling and interpolation. Because we want to avoid write-conflicts, we have to break the problem down to small pieces and I believe GPU is under-utilized when crunching those small pieces. Hopefully the cache and improved atomic performance will help us in this part. But there are some decent CPU work after the GPU so we probably look at 25% improvement in this part at most.

I am wondering also about the 480 that arrived yesterday, under CUDA 3.0. We got only about 480 GFlops single precision from nbody, pretty much the same as my old 285, and the MonteCarloMultiGPU came up with about 118000 options per second, against just over 100000 for a single 285. So on those first two the improvement is either incremental or non-existent. BlackScholes was actually WORSE!

This was all on default settings - any ideas as to how to tweak to get more?

What number of bodies are you using? Try to increase it.

I stuck my 480 in a Mac Pro with external PSU and got better results. This was under XP32 running under Bootcamp on a 2008 Mac Pro host using a Mac edition 285 alongside to the 480.

With the 3.0 sdk binaries the 480 came in at about 560 SP Gflops and 117 DP under nbody, vs about 495 and 57 for the 285. I have not tried the 2.3 binaries though. Under MonteCarloMultiGPU with BOTH cards up the 480 turned in about 118000 options per second and the 285 83000.

Interesting observations about the 2.3 binaries seeming to be better!

Fingers crossed for native OS X drivers…

On my 480 I get 800 Gflops in the Nbody example, though to do so I had to increase the number of particles to 100k. This I think, nicely demonstrates the potential power of Fermi. Since 800 Gflops is attainable in Nbody, it makes me wonder why the SGEMM performance is so poor though (even with the improvements brought by CUBLAS 3.1).

One thing to be aware of - we changed some of the compiler defaults for single precision arithmetic on sm_20 (Fermi) targets. It now enables denormal support and IEEE-compliant reciprocal, division and square root functions by default, which give better precision, but can run slower.

Try adding these flags to your NVCC command line for a fairer comparison:

-ftz=true -prec-div=false -prec-sqrt=false

For details see:

NVIDIA CUDA Programming Guide, sections 5.4.1, G.2

The CUDA Compiler Driver NVCC, pg. 14-15

Strange,

is fma.rn.ftz.f32 and rcp.approx.ftz.f32 performed with full speed?
I thought ftz mode was enabled by default to be fully compatible with previouse programs. I checked assembler output at 2.0 target and saw a lot of ftz.

Thanks to Lev and Simon for their suggestions. It is now clear the default number of bodies in nbodies is misleading about the 480 capability. I tried various multiples of 20000 and above 100000 the 480 punches through 700 Gflops and I have had it hit 720 now. This is without Simon’s recompilation suggestion as well.

Interesting that at some point above 100000 bodies the 285 stops running the code but the 480 just keeps on pumping it out.

Getting more impressed now! Will look at compiler flags once I have reloaded the compiler - having to do a total rebuild after a virus invasion.


Can somebody wtih Fermi on windows run one short cache intensive benchmark?