( A starting comment here is that I am talking about floating point operations only, not fixed point! )
So we have GPUs running at 11-12 GFLOPS/watt ( ex GTX 560m ) which is really quite efficient. Next generation could be expected to have some low/mid-end cards reaching 20 GFLOPs/watt. These are very impressive numbers, likely on the same levels as modern FPGAs, any expert comments on this?
Another interesting question is that of latency, or how long can we wait until a computation is finished. My guess here is that the GPU would prevail for the larger meatier problems while the FPGA is good at being effective at the tiniest of problems.
I would love to have an expert level discussion on this topic because the embedded world doesnt seem to be to keen on releasing GFLOPS/watt numbers on FPGAs solving different kinds of problems ( FFTs, matrix-mul etc,.) at sizes that are even relevant for the GPU. Ex: http://www.altera.com/products/ip/dsp/arithmetic/m-alt-float-point.html , good but not informative enough to make good comparisons.
Here’s an interesting comparison.
For a limited problem of straightforward n-body computation, a 10-FPGA system was faster than a single GPU, about 2:1. It also was more energy efficient than the GPU (stated in the conclusion, with no more details.). However, for such a simple problem, the GPU code was developed and deployed in one day. The FPGA version took over one month.
This 2007 survey article in Parallel Computing 33 mentions the older Tsubame had 360 Clearspeed boards, but there were FPGAs used as well “improving performance by 25%”. Not much detail in how they were used, but likely they were extremely app dependent (as are all uses of FPGAs.)
I think this restriction helps the GPU case quite a bit. Implementing vaguely IEEE-compliant single or double precision floating point tends to use quite a lot of gates on an FPGA. The FPGA regains some advantage if you are willing to think in terms of variable precision and specify the number of bits required for each operation (which in general can be less than the standard 32 or 64 bits used in IEEE floating point). The VFLOAT library provides tools to generate these operations:
I’m not an expert here at all, but from my own experience floating point operations in FPGAs not only need a lot of gates, but also run at much lower throughput than the custom-designed FPUs on GPUs. So you can have a few slow (but maybe flexible) FPUs in an FPGA versus loads of fast FPUs in GPU. I wonder how FPGAs can even remotely be competitive?
Unless maybe you have really nonstandard operations that you can implement directly in the FPGA.
Or you have a problem that does not need any off-chip memory, so you can avoid all power for DRAMs and memory interface.
Custom-precision floating-point is the first step. Then you can use fixed-point long accumulators, fused operators, transcendental functions, or even exotic number systems to get a much more efficient implementation.
Floating point is a one-size-fits-all system that is well suited for CPUs, when the same hardware has to support lots of different applications. But there are really few applications, if any, that actually need IEEE-754 floating point.
In the long term, what we need are more tools to assist the programmer in doing the error analysis you describe. We are not there yet…
DRAM is a good point: GPUs have a huge advantage here.
Floating-point itself is not that hugely expensive to implement on an FPGA. All modern FPGAs have hardwired integer/fixed-point multipliers, which run just as fast as in any other comparable VLSI multiplier.
To do a floating-point multiplication, you basically just have to multiply mantissas and sum exponents together, so it adds little overhead. (Actually, there is some overhead, because the size of the hardware multipliers was chosen with fixed-point signal processing applications in mind, so it is typically 18x18 rather than the 24x24 or 53x53 you need.)
FP addition is more expensive than FP multiplication, as there are no hardwired shifters and leading-one detector in today’s FPGAs.
Also, when you take only power into account, FPGAs have an advantage in dynamic power. The routing matrix is configured statically and does not switch during the computation. Whereas a CPU/GPU has to fetch, decode and process instructions, and move data back and forth between the register file and processing units.
When it comes to power utilization on GPUs, I’ve noticed that many papers have often used inefficient high end GPUs with lesser theoretical GFLOPS / watt numbers. This can vary anywhere from 3-12 GFLOPS/watt.
Another important issue is that people don’t measure actual power consumption under load, which is as SPWorley showed us ( and as Sylvains paper displayed) often far from peak, rather 50-65%. Hence GFLOPS_effective / watt_effective or bandwidth_eff / watt_eff is a very interesting number to see.
So would you say that as soon as the problem size goes outside of what fits on-chip the FPGAs loose their advantage drastically? What type of RAM do they use and what kind of bandwidth? I’m guessing it’s nowhere near that of the GPU and also I don’t think their is the same type of context switching to hide off-chip latencies?
High-end FPGA dev boards typically have one or two 64-bit SO-DIMM sockets accepting DDR2-800 memory. This achieves at best 12.8 GB/s.
With a custom board and one of the biggest FPGAs, you might be able to double or triple that bandwidth. That would still be tiny compared to the 190 GB/s a GeForce 580 can give you for 1/20th of the price…
However, that changes a bit when you scale things up. You can plug many FPGAs together with a fast interconnect (tens of GB/s). On the other hand, GPUs only have one PCIe interface and an anecdotal SLI link for such communication.
Latency hiding is not a problem. You just have to manage it manually. For regular applications that just stream data, this is trivially done with deep pipelining.
I’m very annoyed by the FPGA manufacturers. I find figures for peak floating pointer performance on the top of the line Xilinix vertex 5 stated at 192 GFLOPS, but finding the peak power consumption seems impossible. They just write “we have great performance per watt”, show me the numbers!! :D
Actually, they do. They specifiy it in something like micro watts per mega hertz per DSP48. Since different implementations may use different numbers of DSP48’s per FPU, they leave it to the designer to calculate for their specific implementation.
Old discussion but really interesting subject. What kind of tools is it possible to use to measure watts consumption on GPU/FPGA? I know it is possible to use a wattmeter plugged in the PSU but may be there are some more fine grained tools?