Performance Optimization

Hi, All,

I am porting several numerical software solvers from massively-parallel and vector platforms to GPGPU. My experience is 10 years on Cray C90, YMP, and about 20 years of Cray T3D, T3E, RS and linux clusters. Thus, I am an numerical expert, not CG expert. Probably it is the main reason why I cannot get what I should do with GPUs.

Now I have Fedora 7 installed on Dual Xeon with 2 Gb main memory and GeForce 8800 GTX and millions of sources on massively parallel Fortran and C.

My goal is to reach 100 GFlop/s on my applications on one GTX, otherwise I have no reason to use GPUs - my algorithms work on one dual core with 4-5Gflop/s and if I switch to float instead of double, most of my algorithms require 2-3 times more computational work.

Please, suggest me a good User Manual for tuning a performance, however, Programming Guide V 1.0 seems to me not enough.

If there is no such book, can I post some questions about tuning here?



There is no book yet for specific optimizations for the G80 architecture. Resources for GPGPU programming can be found on The CUDA programming manual is actually a very good starting point as it talks a lot about the hardware. One rarely finds this in other publications. For really low-level optimizations, take a look at the PTX assembler manual for the GPU VM.

The floating point performance on GPUs is usually very easy to get at, one rarely needs to think about it. Your optimizations on GPUs will almost always be centered around memory alignment/bandwidth/latency optimizations. Especially on the G80 architecture, where there are several levels of memory, some do have a cache; for others access can be synchronized. If you browse the discussions on this board, you’ll see that getting your head around the memory stuff is the hardest part for most people. If you post your questions here, I am sure people will try to be helpful.


edit: there are some optimization hints in this lecture

Dear Peter,

thank you very much for your answer and good reference to lecture notes. Hope, to find a way to tune algorithms for GPU. Indeed Crays have the same architecture of vector registers and banks, the world goes back to the same technology.