Kernel calculation optimization Best way to perform low level calculatio

mvergo · August 25, 2008, 3:06pm

For computationally intensive kernels using CUDA 2.0, what is the best way to handle the low level calculations? The visual profiler doesnâ€™t provide a fine enough granularity to optimize specific calculations, so are there any rules of thumb for choosing low level optimization targets.

For example, when using floats, is it best to use the â€œ*â€ multiply operator or _fmul[rn,rz] function, given that the goal is to optimize latency not numeric accuracy?

Another case that comes to mind is using complex arithmetic. What is the best way to perform multi-operation transforms such as a complex conjugate multiply?

jack · September 3, 2008, 11:44pm

I just stumbled upon this website…perhaps it will help you out:

[url=“http://www.cs.rug.nl/~wladimir/decuda/”]http://www.cs.rug.nl/~wladimir/decuda/[/url]

E.D_Riedijk · September 4, 2008, 8:06am

Here are a few guidelines:

keep your code simple. people with experience are often telling that the simplest code was the fastest in the end. I also find myself often trying to rewrite kernels for more speed, but the straightforward one turned out to be fastest.
coalesce your memory reads and writes.
avoid bank conflicts in shared memory
avoid excessive branching (divergent branching within a warp)

When you have made a kernel, the first thing to do is count how much memory is read & written. Then calculate how much GB/s you are doing. Compare with the bandwidth reported by bandwidthtest (device->device). If you are not getting close to that number, only then is it important to optimize calculation. In a LOT of cases the performance is bound by the memory bandwidth (all of my kernels, and I have yet to encounter someone who has a computation-bound kernel).

Often it is faster to redo calculations if it means you do not have to read in memory…

Topic		Replies	Views
Techniques for Kernel Optimization CUDA Programming and Performance	1	5729	July 29, 2010
Current best practices for performing sets of simple operations CUDA Programming and Performance	2	454	March 21, 2017
Optimize - Many small operations (CPU is faster for now?) CUDA Programming and Performance	2	512	July 11, 2019
Simple test, unexpected results: more calculations in each thread, less GPU occupancy time! CUDA Programming and Performance	5	1127	May 27, 2013
Maximising Perfromance of a Application CUDA Programming and Performance	3	1874	May 11, 2012
CUDA Kernel Optimization CUDA Programming and Performance	0	1371	February 26, 2013
Looking for help Optimising the run time of a kernel CUDA Programming and Performance jetson	4	35	October 11, 2024
Optimizing for many concurrent kernels CUDA Programming and Performance	1	237	April 12, 2024
Optimization for small kernels How to optimize small kernels with less instructions CUDA Programming and Performance	21	2376	October 22, 2010
Extremely high number of iterations CUDA Programming and Performance	5	1331	February 14, 2013

Kernel calculation optimization Best way to perform low level calculatio

Related topics