Cuda/OpenCL Optimization Find a compromise between time needed to optimization and performance


I’m currently working on a research project as a trainee in a big financial software company.
We intend to test Cuda and OpenCL technololgies on our algorithms, to benchmark them, test analysis and debuging tools and so on…

If the performance between GPUs and CPUs are interested compared to the time needed to modify and optimize our existing algorithms, we’ll use GPU Computing for good. We are currently working on partnerships with NVidia, HP and Dell so as to propose to customers our solutions directly on optimized servers using GPU computing and are about to receive a new computer equipped with a Tesla GPU (I’m still working in emuDebug mode for the moment).

Here is my question: what are the best steps to follow so as to have the better compromise between spending time in optimization and performance?

Having tested Cuda, I noticed this:

  • non-optimized programs are not that interested compared to the CPU version, optimization can significantly increase performance
  • you can spend a lot of time rewritting programs to better optimize it without significant increase in performance

Here are the steps that seems the most important:

  • use parallelism (at least ^^)
  • use coalesced global memory
  • use as more Stream Multiprocessor of the GPU to best use it
  • use shared memories
  • use tiles and collaboration
  • try to best use ressource dividing blocks and threads (dynamic partitioning)
  • data prefetching
  • avoid memory conflicts
  • bandwidth improvements
  • find the best thread granularity

What are the most important points, the ones that increase significantly performance and does not take that much time?
Did I forgot important points?
How/where can I find examples of programs that are well-optimized (those on NVidia Samples are not that optimized compared to all the possibility)?

As far as OpenCL is concerned, it is supposed to be easier to optimized but I can’t find a word on this while Cuda documentation is abundant but the same goes for OpenCL: how to best optimize it, without wasting too may times modifying the whole source code ans having good performance?

Thanks for your help/advices!
Have a good day.

  1. Get them to invest in some simple card so that you can do real cuda and not emu debug. The geforce 240 should be available for ~100$. That’s no money for a big company (you cost them more wasting time on emulation mode)…2011&sr=1-1

for example

  1. It really depends
  • if you are compute bound, in which case you need to improve your algorithm and make sure you have good enough granularity to fully utilize the card

  • if you are communication bound, then you need to improve access methods

With gt200, coalescing isn’t as important as it was (sometimes the games to achieve coalescing cost more than what you gain).

From personal experience, good communication (reducing global memory accesses) is the most critical part (most algorithms are bandwidth limited). Using textures will get you most of the results, combining them with share memory usage and avoiding bank conflicts is usually 90% of the optimization.

What I usually do is use textures to read into shared memory, work in shared memory, and write efficiently.

The rest is good numerical methods

Yep we are waiting for a tesla card that should arrive soon (I hope).
Thanks for the advices, I should avoid wasting time now focusing on memories issues.

I dont like OpenCL