Cuda/OpenCL Optimization Find a compromise between time needed to optimization and performance

adperret · April 30, 2010, 12:30pm

Hi,

I’m currently working on a research project as a trainee in a big financial software company.
We intend to test Cuda and OpenCL technololgies on our algorithms, to benchmark them, test analysis and debuging tools and so on…

If the performance between GPUs and CPUs are interested compared to the time needed to modify and optimize our existing algorithms, we’ll use GPU Computing for good. We are currently working on partnerships with NVidia, HP and Dell so as to propose to customers our solutions directly on optimized servers using GPU computing and are about to receive a new computer equipped with a Tesla GPU (I’m still working in emuDebug mode for the moment).

Here is my question: what are the best steps to follow so as to have the better compromise between spending time in optimization and performance?

Having tested Cuda, I noticed this:

non-optimized programs are not that interested compared to the CPU version, optimization can significantly increase performance
you can spend a lot of time rewritting programs to better optimize it without significant increase in performance

Here are the steps that seems the most important:

use parallelism (at least ^^)
use coalesced global memory
use as more Stream Multiprocessor of the GPU to best use it
use shared memories
use tiles and collaboration
try to best use ressource dividing blocks and threads (dynamic partitioning)
data prefetching
avoid memory conflicts
bandwidth improvements
find the best thread granularity

What are the most important points, the ones that increase significantly performance and does not take that much time?
Did I forgot important points?
How/where can I find examples of programs that are well-optimized (those on NVidia Samples are not that optimized compared to all the possibility)?

As far as OpenCL is concerned, it is supposed to be easier to optimized but I can’t find a word on this while Cuda documentation is abundant but the same goes for OpenCL: how to best optimize it, without wasting too may times modifying the whole source code ans having good performance?

Thanks for your help/advices!
Have a good day.

laughingrice · May 2, 2010, 12:36am

Hi,

I’m currently working on a research project as a trainee in a big financial software company.

We intend to test Cuda and OpenCL technololgies on our algorithms, to benchmark them, test analysis and debuging tools and so on…

If the performance between GPUs and CPUs are interested compared to the time needed to modify and optimize our existing algorithms, we’ll use GPU Computing for good. We are currently working on partnerships with NVidia, HP and Dell so as to propose to customers our solutions directly on optimized servers using GPU computing and are about to receive a new computer equipped with a Tesla GPU (I’m still working in emuDebug mode for the moment).

Here is my question: what are the best steps to follow so as to have the better compromise between spending time in optimization and performance?

Having tested Cuda, I noticed this:

non-optimized programs are not that interested compared to the CPU version, optimization can significantly increase performance

you can spend a lot of time rewritting programs to better optimize it without significant increase in performance

Here are the steps that seems the most important:

use parallelism (at least ^^)

use coalesced global memory

use as more Stream Multiprocessor of the GPU to best use it

use shared memories

use tiles and collaboration

try to best use ressource dividing blocks and threads (dynamic partitioning)

data prefetching

avoid memory conflicts

bandwidth improvements

find the best thread granularity

What are the most important points, the ones that increase significantly performance and does not take that much time?

Did I forgot important points?

How/where can I find examples of programs that are well-optimized (those on NVidia Samples are not that optimized compared to all the possibility)?

As far as OpenCL is concerned, it is supposed to be easier to optimized but I can’t find a word on this while Cuda documentation is abundant but the same goes for OpenCL: how to best optimize it, without wasting too may times modifying the whole source code ans having good performance?

Thanks for your help/advices!

Have a good day.

Get them to invest in some simple card so that you can do real cuda and not emu debug. The geforce 240 should be available for ~100$. That’s no money for a big company (you cost them more wasting time on emulation mode)

http://www.amazon.com/EVGA-nVidia-GeForce-…2011&sr=1-1

for example

It really depends

if you are compute bound, in which case you need to improve your algorithm and make sure you have good enough granularity to fully utilize the card
if you are communication bound, then you need to improve access methods

With gt200, coalescing isn’t as important as it was (sometimes the games to achieve coalescing cost more than what you gain).

From personal experience, good communication (reducing global memory accesses) is the most critical part (most algorithms are bandwidth limited). Using textures will get you most of the results, combining them with share memory usage and avoiding bank conflicts is usually 90% of the optimization.

What I usually do is use textures to read into shared memory, work in shared memory, and write efficiently.

The rest is good numerical methods

adperret · May 3, 2010, 8:30am

Yep we are waiting for a tesla card that should arrive soon (I hope).
Thanks for the advices, I should avoid wasting time now focusing on memories issues.

Sarnath · May 4, 2010, 12:06am

I dont like OpenCL

Topic		Replies	Views
Same Implementation in CUDA and OpenCL but different performance, and OpenCL Faster? CUDA Programming and Performance	2	1288	October 11, 2013
Any reason to choose CUDA over OpenCL? CUDA Programming and Performance	27	26376	August 2, 2010
Significant speed gap between CUDA and OpenCL - how to debug? CUDA Programming and Performance	3	7680	January 28, 2018
When to use Serial CPU, CUDA, OpenMP and MPI? CUDA Programming and Performance	8	13820	May 29, 2021
Significant speedup of OpenCL vs CUDA CUDA Programming and Performance	23	9277	February 12, 2022
OpenCL performs better than CUDA CUDA Programming and Performance	1	532	March 1, 2011
OpenCL or CUDA? CUDA Programming and Performance	16	11134	October 26, 2011
OpenCL vs Cuda performance on same kernels CUDA Programming and Performance	13	55852	July 15, 2010
CUDA performance vs. openCL performance CUDA Programming and Performance	7	12497	June 8, 2012
OpenCL performs better than CUDA CUDA Programming and Performance	4	11838	March 1, 2011

Cuda/OpenCL Optimization Find a compromise between time needed to optimization and performance

Related topics