maximum flops?

Hi all,

I’m new to GPU’s, investigating them for some scientific computation, and they look very exciting! I’ve been given a sudden deadline to get an estimate of whether GPU processing will be sufficient for a project proposal that’s being prepared. Is there somewhere I can read about maximum theoretical flops for the top GPU chip(s)? Or somewhere to read about how to estimate this? Thanks for any advice, I plan to come back and look at all this more thoroughly when I have the time.

I saw in the FAQ that acceleration of data parallel projects can range from 10x-200x. That’s encouraging, but I need to find out how close to 100x we may get. At least with a best estimate, I imagine implementation will affect a lot).

The algorithm needing acceleration is highly data parallel, and with a minimally divergent kernal (I think that’s the term, it has few logic branches, just cruching a massive step-wise integration).


GTX 200-series devices (including the Tesla C1060/S1070) can dual issue a multiply-add and a multiply per streaming processor per shader clock cycle. This is how to GFLOPS column is computed on the wikipedia page:

(Note that all FLOPS estimates listed are single precision. For double precision, the device can complete one multiply-add per streaming multiprocessor per clock cycle. The GTX has 240 streaming processors grouped into 30 streaming multiprocessors. That factor of 8 difference is common to all CUDA devices at the moment.)

An important thing to keep in mind is device-to-host and device memory bandwidth limitations. In many applications, these turn out to be the limiting factors, and not floating point performance. Conservatively, you should assume that you can get 5 GB/sec between the CPU and the GPU on a PCI-Express 2.0 motherboard. You can also assume, at best, you’ll get about 80% of the theoretical maximum device memory bandwidth in your kernel. (The theoretical max bandwidths are also in that Wikipedia table.) Reaching that level of device bandwidth requires that you read contiguous blocks of memory across your active threads. If your memory access pattern is random, you will get far less memory bandwidth.

Another thing to keep in mind is that multi-GPU solutions (like the GTX 295, the S1070, or just a computer with multiple cards installed) require you to manage each of the devices individually in your program. You will only see linear performance scaling if your task can be partitioned between devices with no communication between devices required.

yep, it is the new era of supercomputing. just like internet in its 90s. big booms to come in this industry.

it is all about hardware (provided that your algorithm isn’t the problem) , just like car tuning, you have to be good at “mechanics”. For example, did you know that SATA hard disk comes from the factory with a jumper that limits it to 150MB/sec instead of 300? If you have a MB that supports 300MB/sec (most do) you can remove that jumper and get 2x on all your disk accesses. That’s what the speed up all about, taking most advantage of your hardware. When i changed my app to use XMMs i got a speed up of about 1000x on the CPU. You don’t need GPUs to start speeding up, check this excellent guide for optimizing software:

Thanks, this all is very helpful. If I understand so far, my app should only need an initial memory transfer to store some matricies in GPU memory, and then the work is to crunch through these during a stpe-wise integration, so memory transfer won’t hopefully be an issue.

And I can probably use separate GPU’s by assigning each to a different matrix-vector mult needed for each step of the integration. They can each happen separately with only a single vector passed in and out at each time step, from GPU-local memory.

I’m hoping the algorithm can be reworked to use single-precision rather than double! Or better yet probably, fixed-point. Is double-long fixed-point supported? I assume it’s used by some people, but in GPU’s?



Thanks, I’ll check out the site, looks very interesting!