Double-precision on GTX 280 and coming telsa S1070

GTX 280 is about 933 GFLOPS performance, but the double-precision performance is only about 90GFLOPS, about 1/10 of peak. they are not my number, mostly from . I got about 40GFLOPs on double-precision.

anybody know the hardware reason behind? how about the coming telsa s1070 for the dp support?

GT200 has 30 multiprocessors, each with 8 SP units , 1 DP unit and 1 special function unit. The SP and DP can execute a MAD instruction (2 flops).

In single precision, your peak performance is 3x240xclock (for the GTX 280 clock is 1.29 for a total Gflops count of 933), where the 3 is 2 from the SP and 1 from the special function unit.

In double precision, your peak is 2x30xclock.

DGEMM in CUBLAS 2.0 will run very close to peak ( on a 1.3GHz card almost 74Gflops).

Tesla S1070 will have a clock of ~1.45 Ghz, for a peak performance close to 87 Gflops.

Thanks very much! That makes all the sense. the CUBLAS 2.0 now comes with DP, that is great!


Can you tell us how does this DP performance compare against CPU’s DP performance?


I am not @mfatica, however, I can post some information about 8 core AMD and 8 core Intel performance:

  1. Linear memory copy: AMD 8GB/s, Intel 3.7GB/s,
  2. Random memory copy: AMD 210MB/s, Intel 650MB/s
  3. DGEMM: AMD 65GFlop/s, Intel 81GFlop/s
  4. Cholesky: AMD 59GFlop/s, Intel 77GFlop/s.



Thanks for posting!! The 8-core Intel CPU look to be as good as a GPU as far as DP is concerned!!

Hmm… Dats a bad news!!

For certain operations, like DGEMM, you can combine CPU and GPU performances.

This plot shows the results of DGEMM on a quad-core Xeon with Intel MKL 10.0.3, on a Tesla C1060 and on both of them at the same time ( I wrote a small library, still in alpha, that will be released later on)

Thanks for the graph! It looks quite interesting! So, I presume the DGEMM ran on all the 4 cores (OpenMP or sthg similar). Is thar right?

Now, as I understand from your earlier post – the DP hardware in GPU is very limited and hence the performance is almost equal to that of a CPU – Is that right?


May I ask what’s included in those measurements? For the Xeon it’s clear that it’s plain execution time. For the Tesla I guess it’s plain execution time, too, as in my measurements it stays a little below 70 GFlops including the copies. However for the combined method only a measurement including the copies makes sense. But which way. Was all data original on the CPU or GPU?

The data is all in CPU memory and the execution time includes the overhead of copying the data to/from the GPU.
With this library, you just make a regular DGEMM call.

SO, I presume the CPU code runs on all 4 cores (openMP or similar)… Is that right?

The CPU portion is still handled by the host library (in this case MKL) and it is using all the cores available.