I wrote a kernel to do some simulations with 3D heat transfer. The problem I am having is that the version of my program running parallel on an 8-core Dell Studio XPS outpaces my GTS-240 GPU. I have tried many things to try to get it to run faster, but I have come to the conclusion that it is just the calculation itself that is too large. The calculation has about 35 FLOP and I simply launch a thread for each cell in the simulation. Still, I only get about 33.4 million cells/second on the GPU and 40.4 mllion/sec on the CPU. It was my understanding that GPU’s excelled at tasks like this, where there are 1.7 million cells per time step that each need this calculation done on them.

I also have 28 array accesses per calculation, all in normal GPU memory.

I was hoping that people more experienced with GPU programming than I could give me some advice. Is the calculation too large to do with a GPU? The single calculation alone (the array accesses and FLOP together) take 35 ms per time step. Is this more or less typical? I don’t really see how I can make it any faster. Would the job go much faster with a Tesla? How much faster about?

It is impossible to give good advice based on the scant information provided. Based on your stated ratio of FLOPS to memory accesses it would appear your code is probably bandwidth limited. If so, make sure you are maximizing effective bandwidth by coalescing accesses, use of texture or constant memory where appropriate.

The code snippet posted on StackOverflow shows many floating-point divisions. It is not clear whether these are single-precision of double-precision divisions, and how you are compiling your code (-prec-div, -use_fast_math flags). Assuming you are at least partially limited by computation on account of the many divisions, check whether you can re-arrange the computation to minimize the number of divisions. If this is single-precision computation, try using approximate divisions instead of IEEE-rounded ones (obviously this may impact the accuracy of your results, so keep an eye out for that).

As was already noted by talonmies on StackOverflow, you seem to be comparing a fairly state-of-the art multi-core CPU system with an older, mid-range (at the time) GPU.