Hi,
Someone in my company suggested to drop the GPUs and move to FPGA (now I know only very little about this)…
but thought I’d might ask here for opinions…
What does people here think about this? I know there are a few posters here who probably have a big
GPU cluster and probably played a bit/checked FPGAs as well - what do you think?
Without going into much detail, it is my opinion that programming environments for FPGAs are very difficult to work with compared to CUDA. I would wait for something like this to get picked up and released as a product: http://impact.crhc.illinois.edu/ftp/conference/sasp_09.pdf
The rule of thumb for FPGAs is that they typically allow you to put 1/4 the hardware on the same area at about 1/4 the speed. The advantages are:
you don’t have to spend millions of dollars to fab your own chip
you can change the hardware any time that you want
The disadvantage is that you still have to design the hardware that goes on the FPGA. Peter Hofstee(lead architect for cell) gave a talk a few years ago where he mentioned that the total design cost for the Cell processor was about 500 engineers working for 4-6 years for a total NRE of about $500 million. A significant portion of that type of design (layout) is not required for doing the equivalent on an FPGA. But you are still looking at around 2/3’s of the work and cost going into everything above VLSI layout.
That type of effort is impossible for the target users of FPGAs, so FPGA companies usually bundle in a compiler that starts from a high level representation of a program and automatically generates hardware. The more abstract the compiler, the less efficient the hardware is on average. The advantage is that the hardware is specifically designed to run only a single application, so it can be optimized for that application. For example, if your application doesn’t use textures, the compiler can throw away texture caches and interpolators and use the area for more cores.
The hardware has the potential to be faster than a GPU. The highest end FPGA that I could find (a Xilinx XC6VSX475T) running at 600mhz could hit over 1.2 TOP/s where each operation is a 25x16-bit integer add/multiply, if it was configured as an array of multipliers. The problem lies in taking a high level application (say in C) and converting that into hardware. Some simple applications will map easily, but a lot of times someone will be stuck doing manual hardware design to fill in the gaps that the compiler couldn’t generate efficient hardware for.
I personally think that hardware design is fun, but if I spent my time designing application specific processors for FPGAs it would take me years to do anything more complicated than matrix multiply.