Comparing C1060, GTX470, GTX480 and C2050 Benchmark results of the Fermi Cards and Tesla generation

Hi

I’ve been testing a number of different GPUs recently with LAMMPScuda, an MD code I am developing (available here: http://code.google.com/p/gpulammps/ and www.tu-ilmenau.de/lammpscuda).

The code makes relatively heavy use of the texture cache and does almost not use the CPU. While one of the main focuses of the code is to scale well on GPU clusters I have only used one GPU here since I wanted to compare the performance of the GPUs only.

As a comparison I run the CPU version of LAMMPS (which uses the exact same algorithms) on a conventional node with 2 Quadcore Nehalems (X5550 @ 2.66GHz).

I have tested three different systems:

lj-melt: lowest amount of computations per memory access

(for anyone familiar with MD: its a plain lj system with 2.5 cutoff, 0.84 density, 850k atoms)

silicate/long: half of the time is spend on an 3D FFT (using cufft) and the rest is much more compute intense than lj-melt

(lithium silicate glass, buckingham potential + long range coulomb via pppm, ~12k atoms)

silicate/cut: also more compute intense than lj-melt, but no FFT

(lithium silicate glass, buckingham potential + cutoff coulomb (10A), ~100k atoms)

Single Precision

8xCPU    C1060    GTX470    GTX480   C2050    C2050ECC 

lj-melt         293      114      143       116      131      155

silicate/long   212      63.6     37.2      31.7     38.4     41.4

silicate/cut    580      123      84.2      69.8     88.9     91.5

Double Precision

8xCPU    C1060    GTX470    GTX480   C2050    C2050ECC

lj-melt         293      237      183       152      167      206

silicate/long   212      200      80.9      67.4     80.5     94.0

silicate/cut    580      536      285       221      260      353

As you can see in the first example the fermi cards are slowed down by the lack of enough texture throughput, hence the C1060 can actually beat the Fermi GPUs in single precision. In the other examples which are much less dominated by texture reads the Fermi GPUs are signifcantly faster than the C1060. As more or less expected, the GTX470 is about as fast as the C2050 since they have both the same number of cores (in the texture heavy case the C2050 is better, am I remembering correct that it got one more texture unit than the GTX470?). The GTX480 is roughly 20% faster than a GTX470.

It is interesting to see that while the Fermi cards are generally much better in double prec than the C1060 (more so than in single prec). The C2050 can not show of its much higher double prec power compared to the Geforce GPUs.

But anyway its nice to see that the Fermi GPUs beat a full node of modern intel cpus by a factor of 2-3 even in double precision.

I thought that these numbers might be interesting for you.

Cheers

Ceearem

Hi

I’ve been testing a number of different GPUs recently with LAMMPScuda, an MD code I am developing (available here: http://code.google.com/p/gpulammps/ and www.tu-ilmenau.de/lammpscuda).

The code makes relatively heavy use of the texture cache and does almost not use the CPU. While one of the main focuses of the code is to scale well on GPU clusters I have only used one GPU here since I wanted to compare the performance of the GPUs only.

As a comparison I run the CPU version of LAMMPS (which uses the exact same algorithms) on a conventional node with 2 Quadcore Nehalems (X5550 @ 2.66GHz).

I have tested three different systems:

lj-melt: lowest amount of computations per memory access

(for anyone familiar with MD: its a plain lj system with 2.5 cutoff, 0.84 density, 850k atoms)

silicate/long: half of the time is spend on an 3D FFT (using cufft) and the rest is much more compute intense than lj-melt

(lithium silicate glass, buckingham potential + long range coulomb via pppm, ~12k atoms)

silicate/cut: also more compute intense than lj-melt, but no FFT

(lithium silicate glass, buckingham potential + cutoff coulomb (10A), ~100k atoms)

Single Precision

8xCPU    C1060    GTX470    GTX480   C2050    C2050ECC 

lj-melt         293      114      143       116      131      155

silicate/long   212      63.6     37.2      31.7     38.4     41.4

silicate/cut    580      123      84.2      69.8     88.9     91.5

Double Precision

8xCPU    C1060    GTX470    GTX480   C2050    C2050ECC

lj-melt         293      237      183       152      167      206

silicate/long   212      200      80.9      67.4     80.5     94.0

silicate/cut    580      536      285       221      260      353

As you can see in the first example the fermi cards are slowed down by the lack of enough texture throughput, hence the C1060 can actually beat the Fermi GPUs in single precision. In the other examples which are much less dominated by texture reads the Fermi GPUs are signifcantly faster than the C1060. As more or less expected, the GTX470 is about as fast as the C2050 since they have both the same number of cores (in the texture heavy case the C2050 is better, am I remembering correct that it got one more texture unit than the GTX470?). The GTX480 is roughly 20% faster than a GTX470.

It is interesting to see that while the Fermi cards are generally much better in double prec than the C1060 (more so than in single prec). The C2050 can not show of its much higher double prec power compared to the Geforce GPUs.

But anyway its nice to see that the Fermi GPUs beat a full node of modern intel cpus by a factor of 2-3 even in double precision.

I thought that these numbers might be interesting for you.

Cheers

Ceearem

why someone would pay over 4x the price for something with relatively negligible performance gains is beyond me. but as long as nvidia can sell it, i suppose it what keeps the desktop cards cheap.

why someone would pay over 4x the price for something with relatively negligible performance gains is beyond me. but as long as nvidia can sell it, i suppose it what keeps the desktop cards cheap.

Would you leave a GTX480 number crunching 7/7 ?

Would you leave a GTX480 number crunching 7/7 ?

Yes, but I can tolerate the downtime from the very occasional hardware failure. I’ve lost 2 GeForce cards out of nearly a dozen CUDA workhorses over the past 3 years, and it wasn’t even the most heavily used ones. That failure fraction would have to be reversed for a Tesla to become cost effective option in my case.

If CUDA was field-analyzing the telemetry data from a $1M/day test drill operation, then I would be happy to pay for the extra quality assurance. Of course, I might be further ahead to spend the cash on a redundant system tolerant of device failure, regardless of how much the quality assurance the device got. Everyone is going to do that [risk * cost of failure] calculation differently depending on their situation.

(That said, ECC is a very nice feature on the Tesla which I think sells it more than any burn-in testing NVIDIA does. Smart architecture can cope gracefully with visibly failed devices, but silent corruption is very hard to catch unless you are comparing results from duplicate jobs or doing other consistency checks.)

Yes, but I can tolerate the downtime from the very occasional hardware failure. I’ve lost 2 GeForce cards out of nearly a dozen CUDA workhorses over the past 3 years, and it wasn’t even the most heavily used ones. That failure fraction would have to be reversed for a Tesla to become cost effective option in my case.

If CUDA was field-analyzing the telemetry data from a $1M/day test drill operation, then I would be happy to pay for the extra quality assurance. Of course, I might be further ahead to spend the cash on a redundant system tolerant of device failure, regardless of how much the quality assurance the device got. Everyone is going to do that [risk * cost of failure] calculation differently depending on their situation.

(That said, ECC is a very nice feature on the Tesla which I think sells it more than any burn-in testing NVIDIA does. Smart architecture can cope gracefully with visibly failed devices, but silent corruption is very hard to catch unless you are comparing results from duplicate jobs or doing other consistency checks.)

Actually, it’s the teens and 20-somethings playing Call of Duty, Bad Company, and whatever other FPSs are hot this quarter that keep the desktop cards cheap (and drive the market that CUDA cards are a

small offshoot of.)

Regards,

Martin

Actually, it’s the teens and 20-somethings playing Call of Duty, Bad Company, and whatever other FPSs are hot this quarter that keep the desktop cards cheap (and drive the market that CUDA cards are a

small offshoot of.)

Regards,

Martin