Please kindly leave your replies.
Please also indicate if using CUDA has been a success in terms of performance or ease of development compared with CPU development as well.
Personally, I have worked on a problem that explores properties of division rings which obtained 8x speedup with a 8800GTX over an OpenMP version (linear speedup) on a 3.2GHz q6600 (oc’d quad core). This was my first CUDA program. The coding took much longer with CUDA. Getting the code to run on the GPU was did not take so long but getting it to run fast took quite a while.
I also worked on an image analysis program that calculates gaussian properties for regions of an image which only achieved a little over 2x speed up compared with a single core of the same processor (easily could be sped up with OpenMP though leaving the GPU behind). The number of individual items to compute was low (~5000) and examining the PTX there looked like the number of memory operations were high compared to arithmetic operations. Also, the kernel was big requiring 43 registers if I remember greatly limiting the occupancy. To obtain this small speedup took much too long.
I also for fun implemented a fractal program which is roughly 20x speedup over an OpenMP version on a quad core. Taking into account the overhead of drawing on the CPU version increases the performance of the GPU version (the GPU version did not transfer the result back to the CPU). The coding was easier on the GPU than on the CPU since tricks neceesary to improve the performance of a CPU version were not necessary. One could focus on the fractal function. The speedup was trival to obtain.
At my day job I work in the field of bioinformatics. To this point I have not found any great application of GPU to that field but am still looking.
We have been using CUDA for bioinformatics applications. Although currently only in a research setting, we have developed some code for for example sequence alignments.
The speed-up for microarray data analysis with CUDA is enormous! This is obvious, because most of the the calculations are independent and can therefore be executed in parallel.
We are implementing a graphics processing toolbox with CUDA. Starting from image rotation to classification algorithms. Everything which can be parallelized is worth implementing with CUDA. The speed up is enormous.
We have implemented a Molecular Dynamics (MD) simulation which runs entirely on the GPU. It is intended for (coarsegrained) soft-matter research. We focused on three algorithms:
A ) MD for long-ranged interactions, where each particle interacts with all other particles.
=> speedup: up to 80 times
B ) MD for short-ranged interactions, where each particle interacts only with particles in its neighborhood (via cell lists).
=> speedup: up to 40 times
C) A random number generator according to the GNU rand48 algorithm (used for a Langevin thermostat).
=> speedup: up to 150 times (compared to the GNU stdlib implementation).
More details can (soon) be found at our MDGPU homepage. The site is still under construction, but the source code for the RNG is already available.
I implemented MD on the GPU too, though using a neighborlist approach, a little bit different than lord_jake’s.
Speedups vs. the CPU range from 30 - 60 depending on the benchmark. But benchmarks are only good at telling you how fast the benchmark is. And comparisons to CPU implementations depend highly on how optimal it is, too. So I do not put much weight into those numbers. It may very well be that I can make the CPU version 1.5x faster with the right tweaks, though I have already done my best to make it optimal.
For a “practical” comparison I tested real simulations I’m doing currently for research on my GPU code vs. LAMMPS (http://lammps.sandia.gov/) running on the fastest computer cluster on campus (http://andrew.ait.iastate.edu/HPC/lightning/description.html). LAMMPS is a widely used and very fast software package. A single 8800 GTX GPU performs at the level of 24 processor cores on the cluster. I plan on developing a multi-gpu version as soon as I can get my hands on the hardware.
A paper discussing the algorithms has been submitted to the journal of computational physics, and is currently under review. The source code will also be released under an open source license once a few more features have been added to make it more useful for performing everyday simulations.
My web page will be updated when the source is released: http://joaander.public.iastate.edu
We are using GPU to make pasword recovery faster (link).
GPUs perform very well on cryptographic algorithms such as MD4, MD5 and SHA1. We’ve got 10x-20x speedup (depending on CPU you’re comparing with) and I’m sure better performance is still possible.
We will be using GPUs in our ultrasound breast imaging system. The scanner creates volumetric breast images from scattered ultrasound which will hopefully aid in the detection, diagnosis, and treatment of breast cancer. (The device is not currently approved by the FDA or other regulatory body for sale or use - we are working on gaining those approvals now.) Here is a link to our website.
I have ported the inverse scattering algorithm that generates the volumetric images from a Fortran/MPI implementation to CUDA. A single 8800GTX can run the algorithm 3x faster than the 7 node Intel/Kontron Pentium M cluster currently in our clinical systems - a speedup of about 20x over one CPU. A single 8800GTX is about 5x faster than a single Core 2 Duo for this “naive” implementation of the algorithm. Performance will increase as I optimize the code, spread the computation over multiple GPUs, and switch to GPUs with more memory.
Performance isn’t the only advantage the GPU has over the cluster. A GPU-based computation node (including a CPU, motherboard, disk drive, etc) requires less space in the device, uses less electricity, generates less heat, and is less than 1/10th the cost. The GPU is more reliable than the cluster because it doesn’t need a shared filesystem on our RAID, cluster management software, multiple operating systems, or the FibreChannel fabric. A multi-GPU design scales better than the cluster in performance, power requirements, heat dissipation, and especially cost.
(Sorry I sound like a CUDA salesman… I am pretty excited about how the GPU compares to the technology we’ve been using.)
Implementation hasn’t been difficult - at least not once I caught on to some of the basic ideas of CUDA and GPU computing. I haven’t done GPU computing until now, so it took me about a month of “messing around” to get the hang of things.
I agree with MisterAnderson42 that if you post the speedups in comparison to a single-cpu implementation you wrote, it does not make much sense, because the numbers you get depend on how fast code you can write for cpu. Did you try to compare it to widely used codes like namd/amber/lammps/etc?
People reported that gpu-accelerated namd works just 5 times faster than on cpu. From this point of view, 80x speedup is a really something very different…
Did you use both cores of the Core2 for this comparison?
Oh, I totally agree that speedups (as well as ‘measurements’ of GFLOPS etc…) have to be taken we a grain of salt. Like MisterAnderson42, I wrote my cpu code as optimal as I could. But every benchmark depends on implementation, compiler, hardware, and the moon tides. I just added them here because wildcat4096 asked for a statement of success. But our work was not about ‘‘who has the faster code or benchmark’’, but about the fact that MD CAN be implemented to run EFFICIENTLY and ENTIRELY on a GPU. And whether the speedup is 20 or 80 fold, the message is the same.
The reason NAMD is ‘‘only’’ 5 times faster right now is because they are in an early development stage. And due to the complexity of their code they ported only parts of their program to the GPU. I expect them to perform better in near future.
Yes. However, there are some things to keep in mind:
The performance improvement is for the entire run of the algorithm, of which only some steps run efficiently on the GPU.
Some steps run more efficiently on the GPU than others. In convolution, for example, our FFTs sizes are too small to show an “eye-popping” improvement over the CPU (maybe 5x to 10x), but performing pointwise multiplies in large batches yeilds performance approaching 100x of the CPU.
The algorithm uses large datasets - upwards of 30GB - so many CPU<->GPU transfers are necessary which hinder performance.
The Fortran/MPI implementation has been worked, re-worked, and optimized over many years by several different people. The CUDA implementation is only a few months old and so far only has my eyes looking at it. And I’m certainly no CUDA optimization expert.
I’m working on accelerating video codecs using CUDA at the moment. In my current project I get about a 12 times speedup (of the accelerated parts) compared to the quite heavily optimized CPU implementation.
I was doing some traditional graphics using CUDA. CUDA is surprisingly easier to use and sometimes even faster than graphics pipeline. I once saw a 6x speed up after switching from dx10 to CUDA.
porting SETI to GPU … (batch mode for 1d FFT is 10x faster as implementation in IPP/APL on C2D/A64X2/BArcelona)
I’m doing image undistortion for real-time tracking on HD video frames.
Kernel density estimation of 3D distributions of points for use in maximum likelihood fitting. (If all goes well, CUDA will be helping to measure the flux of neutrinos coming from the Sun.)
[Edit] Oh, and I forgot to mention: It’s about 8x faster than the best CPU algorithm we could come up with, and 40x faster if you compare equal work. Some CPU algorithm optimizations involve too much branching, and can’t be efficiently ported to the GPU, which is why the final code was “only” 8x faster. Considering that a $550 8800 GTX is much cheaper than 7 additional computers, this has been a huge win for us.
We’ve had some luck with voice recognition enhanced by CUDA.
I am trying to do something in search engineer with cuda, using its advantage in computing massive of data :)
Exploring the capabilities of CUDA in financial markets applications.
Started with some random number generators and quickly moved to Monte Carlo simulations of geometric brownian motion and local volatility processes to price exotic financial derivatives. I will try to move to more complex models as time allows,
After MC I would like to see if I can get enhancements in solving PDEs. Either speedup or increase in number of dimensions would be welcome :)