GFLOPs of intel xeon e5 2609

I am working on an intel xeon e5 2609 machine of 8 CPUs , how to know the gflops of my CPUs ? I want to compare the gflops of it with gflops of tesla c2070.

Thanks a lot

Xeon E5 2609 is a Sandybridge-class CPU which can execute one AVX multiply plus one AVX add per cycle. An AVX SIMD operation comprises four double-precision or eight single-precision lanes. Therefore, when running AVX code, the theoretical maximum floating-point throughput is 4 [cores] * 2.4e9 [Hz] * (4+4) [floating-point ops] = 76.8 double-precision GFLOPS. Single-precision performance is twice that, namely 153.6 GFLOPS.

Since you have an eight-processor machine, the combined theoretical throughput of all eight CPUs is therefore 614.4 double-precision GFLOPS, 1.228 single-precision TFLOPS. By comparisonn, the theoretical throughput of the C2070 is 515 double-precision GFLOPS, 1.03 single-precision TFLOPS, or about 84% of the combined CPUs.

8 xeon e5 2609 at $300 per CPU= $2,400 for 1,228 single precision GFLOPS

A single $550 EVGA GTX 980 clocks in at about 5,400 single GFLOPS.

cost per GFLOP for CPU set = $0.51

cost per GFLOP for GPU = $0.102

Also the memory bandwidth difference between a CPU and a current GPU (the tesla C2070 is at least 4 years old) is between 5-15, so that should be considered in any valid comparison.

Does this calculation of CPU GFLOPS imply on a single threaded written program ? Or it has to be AVX activated somehow?

Since my computation includes the contributions from all CPUs and all cores within each CPU it quite obviously does not pertain to single-thread execution. Since I assumed usage of all AVX lanes, it obviously also does not pertain to scalar execution. Utilizing the full floating-point performance of your system requires multi-threaded, SIMDized computation: 32 threads, each using 4-way or 8-way SIMD computation.

Note that threading and SIMD parallelization are two forms of parallelism (namly thread paralellism and data parallelism) that are orthogonal to each other: One can write an application that is multi-threaded, but where each thread only performs scalar computation using a single AVX lane. Likewise one can write a single-threaded application that uses SIMDized code using all AVX lanes.