maximum flops?

mgstauffer · June 13, 2009, 6:50pm

Hi all,

I’m new to GPU’s, investigating them for some scientific computation, and they look very exciting! I’ve been given a sudden deadline to get an estimate of whether GPU processing will be sufficient for a project proposal that’s being prepared. Is there somewhere I can read about maximum theoretical flops for the top GPU chip(s)? Or somewhere to read about how to estimate this? Thanks for any advice, I plan to come back and look at all this more thoroughly when I have the time.

I saw in the FAQ that acceleration of data parallel projects can range from 10x-200x. That’s encouraging, but I need to find out how close to 100x we may get. At least with a best estimate, I imagine implementation will affect a lot).

The algorithm needing acceleration is highly data parallel, and with a minimally divergent kernal (I think that’s the term, it has few logic branches, just cruching a massive step-wise integration).

Thanks,
Michael

seibert · June 13, 2009, 7:15pm

GTX 200-series devices (including the Tesla C1060/S1070) can dual issue a multiply-add and a multiply per streaming processor per shader clock cycle. This is how to GFLOPS column is computed on the wikipedia page:

[url=“List of Nvidia graphics processing units - Wikipedia”]List of Nvidia graphics processing units - Wikipedia

(Note that all FLOPS estimates listed are single precision. For double precision, the device can complete one multiply-add per streaming multiprocessor per clock cycle. The GTX has 240 streaming processors grouped into 30 streaming multiprocessors. That factor of 8 difference is common to all CUDA devices at the moment.)

An important thing to keep in mind is device-to-host and device memory bandwidth limitations. In many applications, these turn out to be the limiting factors, and not floating point performance. Conservatively, you should assume that you can get 5 GB/sec between the CPU and the GPU on a PCI-Express 2.0 motherboard. You can also assume, at best, you’ll get about 80% of the theoretical maximum device memory bandwidth in your kernel. (The theoretical max bandwidths are also in that Wikipedia table.) Reaching that level of device bandwidth requires that you read contiguous blocks of memory across your active threads. If your memory access pattern is random, you will get far less memory bandwidth.

seibert · June 13, 2009, 7:20pm

Another thing to keep in mind is that multi-GPU solutions (like the GTX 295, the S1070, or just a computer with multiple cards installed) require you to manage each of the devices individually in your program. You will only see linear performance scaling if your task can be partitioned between devices with no communication between devices required.

nuliknol · June 14, 2009, 12:04pm

yep, it is the new era of supercomputing. just like internet in its 90s. big booms to come in this industry.

it is all about hardware (provided that your algorithm isn’t the problem) , just like car tuning, you have to be good at “mechanics”. For example, did you know that SATA hard disk comes from the factory with a jumper that limits it to 150MB/sec instead of 300? If you have a MB that supports 300MB/sec (most do) you can remove that jumper and get 2x on all your disk accesses. That’s what the speed up all about, taking most advantage of your hardware. When i changed my app to use XMMs i got a speed up of about 1000x on the CPU. You don’t need GPUs to start speeding up, check this excellent guide for optimizing software:

http://www.agner.org/optimize/

mgstauffer · June 15, 2009, 4:58pm

Thanks, this all is very helpful. If I understand so far, my app should only need an initial memory transfer to store some matricies in GPU memory, and then the work is to crunch through these during a stpe-wise integration, so memory transfer won’t hopefully be an issue.

And I can probably use separate GPU’s by assigning each to a different matrix-vector mult needed for each step of the integration. They can each happen separately with only a single vector passed in and out at each time step, from GPU-local memory.

I’m hoping the algorithm can be reworked to use single-precision rather than double! Or better yet probably, fixed-point. Is double-long fixed-point supported? I assume it’s used by some people, but in GPU’s?

Cheers,

Michael

mgstauffer · June 15, 2009, 4:58pm

Thanks, I’ll check out the site, looks very interesting!

Topic		Replies	Views
What is maximum speed-up that can be obtained with GPU? CUDA Programming and Performance	6	12215	June 24, 2016
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37433	August 30, 2009
Characterization of the Speed Up on GPGPGU. 400X speed up on a Molecular Dynamics Application. CUDA Programming and Performance	5	1486	December 8, 2009
Could GPU computing speed up computation in data-larger-than-RAM cases? CUDA Programming and Performance	5	1867	April 19, 2016
Chart GPU vs CPU CUDA Programming and Performance	11	13773	October 15, 2008
Speed-up and bandwidth CUDA Programming and Performance	12	9780	May 4, 2008
300x to 600x times faster... really? CUDA Programming and Performance	92	34425	February 8, 2010
\|\| programming, basic question CUDA Programming and Performance	18	1296	April 30, 2018
Can speed up ratio greater than the number of GPU processor cores? CUDA Programming and Performance	11	2740	June 3, 2010
CUDA basic discussions(Looping, branching) Why GPU faster than CPU when both use C? CUDA Programming and Performance	6	18682	August 22, 2007

maximum flops?

Related topics