The ability of GPU's to do long calculations

ethoma7329 · June 8, 2012, 7:37pm

Hello,

I wrote a kernel to do some simulations with 3D heat transfer. The problem I am having is that the version of my program running parallel on an 8-core Dell Studio XPS outpaces my GTS-240 GPU. I have tried many things to try to get it to run faster, but I have come to the conclusion that it is just the calculation itself that is too large. The calculation has about 35 FLOP and I simply launch a thread for each cell in the simulation. Still, I only get about 33.4 million cells/second on the GPU and 40.4 mllion/sec on the CPU. It was my understanding that GPU’s excelled at tasks like this, where there are 1.7 million cells per time step that each need this calculation done on them.

I also have 28 array accesses per calculation, all in normal GPU memory.

I was hoping that people more experienced with GPU programming than I could give me some advice. Is the calculation too large to do with a GPU? The single calculation alone (the array accesses and FLOP together) take 35 ms per time step. Is this more or less typical? I don’t really see how I can make it any faster. Would the job go much faster with a Tesla? How much faster about?

Thank you.

njuffa · June 8, 2012, 8:25pm

Cross reference to StackOverflow thread, which gives a snippet of code not shown above plus some information on the HW configuration:

http://stackoverflow.com/questions/10955337/are-gpus-limited-in-their-ability-to-do-larger-calculations

It is impossible to give good advice based on the scant information provided. Based on your stated ratio of FLOPS to memory accesses it would appear your code is probably bandwidth limited. If so, make sure you are maximizing effective bandwidth by coalescing accesses, use of texture or constant memory where appropriate.

The code snippet posted on StackOverflow shows many floating-point divisions. It is not clear whether these are single-precision of double-precision divisions, and how you are compiling your code (-prec-div, -use_fast_math flags). Assuming you are at least partially limited by computation on account of the many divisions, check whether you can re-arrange the computation to minimize the number of divisions. If this is single-precision computation, try using approximate divisions instead of IEEE-rounded ones (obviously this may impact the accuracy of your results, so keep an eye out for that).

As was already noted by talonmies on StackOverflow, you seem to be comparing a fairly state-of-the art multi-core CPU system with an older, mid-range (at the time) GPU.

ymc · June 9, 2012, 12:56am

You can give GTX580 a shot before your think about Tesla. You should get around 5x faster GFLOPS with 580

ymc · June 9, 2012, 1:09am

Do you do your calculations in double or float? If double, then you need Tesla or Quadro.

Topic		Replies	Views
maximum flops? CUDA Programming and Performance	5	3396	June 15, 2009
GFLOP question CUDA Programming and Performance	2	3085	January 16, 2009
I have two question. CUDA Programming and Performance	11	7067	December 2, 2007
GPU (Geforce 8400) is three times _slower_ than CPU while adding vectors. What am I doing wrong? CUDA Programming and Performance	7	15342	January 17, 2010
Speed-up and bandwidth CUDA Programming and Performance	12	9944	May 4, 2008
Cuda code performance CUDA Programming and Performance	14	3316	December 16, 2014
Speed improvement CUDA Programming and Performance	18	8524	December 5, 2008
Comparing GPUs to CPUs in a particular situation CUDA Programming and Performance	7	17303	April 4, 2011
240 versus 32 cores CUDA Programming and Performance	8	1977	April 23, 2009
Comparison of execution time in CPU and GPU is the CPU better than GPU in execution time ??? CUDA Programming and Performance	6	10631	September 17, 2010

The ability of GPU's to do long calculations

Related topics