GPUs VS FPGAs/DSP GFLOPS/watt & Latency

( A starting comment here is that I am talking about floating point operations only, not fixed point! )

So we have GPUs running at 11-12 GFLOPS/watt ( ex GTX 560m ) which is really quite efficient. Next generation could be expected to have some low/mid-end cards reaching 20 GFLOPs/watt. These are very impressive numbers, likely on the same levels as modern FPGAs, any expert comments on this?

Another interesting question is that of latency, or how long can we wait until a computation is finished. My guess here is that the GPU would prevail for the larger meatier problems while the FPGA is good at being effective at the tiniest of problems.

I would love to have an expert level discussion on this topic because the embedded world doesnt seem to be to keen on releasing GFLOPS/watt numbers on FPGAs solving different kinds of problems ( FFTs, matrix-mul etc,.) at sizes that are even relevant for the GPU. Ex: Intel® FPGAs and Programmable Devices-Intel® FPGA , good but not informative enough to make good comparisons.

Here’s an interesting comparison.
For a limited problem of straightforward n-body computation, a 10-FPGA system was faster than a single GPU, about 2:1. It also was more energy efficient than the GPU (stated in the conclusion, with no more details.). However, for such a simple problem, the GPU code was developed and deployed in one day. The FPGA version took over one month.

Another data point. Tsubame, the biggest Japanese supercomputer, was built in 2006 to use FPGA coprocessors. It was #29 in the world.
In 2008 they removed the FPGAs and installed Tesla GPUs.It uses the same amount of power as its predecessor, but has 15x throughput improvement for the problems it runs. It’s now #7 in the world.

Tsubame never used FPGAs. They had ClearSpeed cards.

I posted a report on the talk in 2008 at the NVISION conference when they were about to switch over from Clearspeed boards to Teslas.

This 2007 survey article in Parallel Computing 33 mentions the older Tsubame had 360 Clearspeed boards, but there were FPGAs used as well “improving performance by 25%”. Not much detail in how they were used, but likely they were extremely app dependent (as are all uses of FPGAs.)

I think M.Fatica might have been involved in both the ClearSpeed and Telsa phases of Tsubame, so if he says there were no FPGAs, I would believe him.

“The performance of this design is not as good as the highly optimized version of N-body simulation in the CUDA SDK.”

Also interesting (as you mentioned):

"The largest drawback

of FPGA design is the difficulty of implementation. In con-

trast to the seconds of compile time for CUDA and the less

than 1 day development time of the GPU program, the 5

hours place and route time and over a month development

time of the FPGA seems unattractive."

I think it is safe to draw the conclusion that COTS performance/$ and development cost/$ the GPU is a clear winner. I’m also quite convinced that it is a winner in raw performance.

I think this restriction helps the GPU case quite a bit. Implementing vaguely IEEE-compliant single or double precision floating point tends to use quite a lot of gates on an FPGA. The FPGA regains some advantage if you are willing to think in terms of variable precision and specify the number of bits required for each operation (which in general can be less than the standard 32 or 64 bits used in IEEE floating point). The VFLOAT library provides tools to generate these operations:

http://www.ece.neu.edu/groups/rcl/projects/floatingpoint/index.html

Of course, getting maximum benefit from this approach requires rigorous, quantitative error analysis of your calculation, which most people are unwilling to do.

deleted

I’m not an expert here at all, but from my own experience floating point operations in FPGAs not only need a lot of gates, but also run at much lower throughput than the custom-designed FPUs on GPUs. So you can have a few slow (but maybe flexible) FPUs in an FPGA versus loads of fast FPUs in GPU. I wonder how FPGAs can even remotely be competitive?
Unless maybe you have really nonstandard operations that you can implement directly in the FPGA.
Or you have a problem that does not need any off-chip memory, so you can avoid all power for DRAMs and memory interface.

I definitely agree… except that our library is better: External Image

http://flopoco.gforge.inria.fr/

Custom-precision floating-point is the first step. Then you can use fixed-point long accumulators, fused operators, transcendental functions, or even exotic number systems to get a much more efficient implementation.

Floating point is a one-size-fits-all system that is well suited for CPUs, when the same hardware has to support lots of different applications. But there are really few applications, if any, that actually need IEEE-754 floating point.

In the long term, what we need are more tools to assist the programmer in doing the error analysis you describe. We are not there yet…

DRAM is a good point: GPUs have a huge advantage here.

Floating-point itself is not that hugely expensive to implement on an FPGA. All modern FPGAs have hardwired integer/fixed-point multipliers, which run just as fast as in any other comparable VLSI multiplier.

To do a floating-point multiplication, you basically just have to multiply mantissas and sum exponents together, so it adds little overhead. (Actually, there is some overhead, because the size of the hardware multipliers was chosen with fixed-point signal processing applications in mind, so it is typically 18x18 rather than the 24x24 or 53x53 you need.)

FP addition is more expensive than FP multiplication, as there are no hardwired shifters and leading-one detector in today’s FPGAs.

Also, when you take only power into account, FPGAs have an advantage in dynamic power. The routing matrix is configured statically and does not switch during the computation. Whereas a CPU/GPU has to fetch, decode and process instructions, and move data back and forth between the register file and processing units.

When it comes to power utilization on GPUs, I’ve noticed that many papers have often used inefficient high end GPUs with lesser theoretical GFLOPS / watt numbers. This can vary anywhere from 3-12 GFLOPS/watt.

Another important issue is that people don’t measure actual power consumption under load, which is as SPWorley showed us ( and as Sylvains paper displayed) often far from peak, rather 50-65%. Hence GFLOPS_effective / watt_effective or bandwidth_eff / watt_eff is a very interesting number to see.

So would you say that as soon as the problem size goes outside of what fits on-chip the FPGAs loose their advantage drastically? What type of RAM do they use and what kind of bandwidth? I’m guessing it’s nowhere near that of the GPU and also I don’t think their is the same type of context switching to hide off-chip latencies?

High-end FPGA dev boards typically have one or two 64-bit SO-DIMM sockets accepting DDR2-800 memory. This achieves at best 12.8 GB/s.

With a custom board and one of the biggest FPGAs, you might be able to double or triple that bandwidth. That would still be tiny compared to the 190 GB/s a GeForce 580 can give you for 1/20th of the price…

However, that changes a bit when you scale things up. You can plug many FPGAs together with a fast interconnect (tens of GB/s). On the other hand, GPUs only have one PCIe interface and an anecdotal SLI link for such communication.

Latency hiding is not a problem. You just have to manage it manually. For regular applications that just stream data, this is trivially done with deep pipelining.

Aha, so FPGAs might have a slight advantage when it comes to communication in a multi-FPGA setup, not a common bottleneck though.

PCIe 3.0 promises 16 GB/s which should alleviate things.

You’ll need multiple FPGAs to achieve the same floating-point throughput as a single GPU though, so I think it is important that they can be closely coupled.

I’m very annoyed by the FPGA manufacturers. I find figures for peak floating pointer performance on the top of the line Xilinix vertex 5 stated at 192 GFLOPS, but finding the peak power consumption seems impossible. They just write “we have great performance per watt”, show me the numbers!! :D

Actually, they do. They specifiy it in something like micro watts per mega hertz per DSP48. Since different implementations may use different numbers of DSP48’s per FPU, they leave it to the designer to calculate for their specific implementation.

Old discussion but really interesting subject. What kind of tools is it possible to use to measure watts consumption on GPU/FPGA? I know it is possible to use a wattmeter plugged in the PSU but may be there are some more fine grained tools?