Techniques for Kernel Optimization

Eric3918 · July 29, 2010, 12:18pm

How does one really go about optimizing a kernel in practice, i.e. it’s my job. The CUDA Profiler is ok but does not give all the information I need and on top of that it is very buggy. It also crashes when I try to run anything but the smallest simulation.

My strategy so far has been to use guidance from the programming guide and Kirk and Hwu’s book to try to guess what is holding me back, and then try it. Unfortunately this is starting to feel like I’m just taking shots in the dark. The best example of this is that the implementation of my FDTD algorithm with shared memory is actually much slower than a global memory only implementation.

This is becoming very frustrating, and it’s not at all clear to me that CUDA is well developed enough for use by anyone other than GPU experts with significant experience. Where are the tools that I really need to get the job done? Does it require a true CUDA/GPU expert to create simulations which are really fast and useful?

Gregory_Diamos · July 29, 2010, 12:36pm

If you try to take a look at writing high performance codes for CPUs, you run into similar problems. You still have to worry about things like float2Int being significantly slower on netburst vs core architecture, cache contention using multi-threaded applications on newer CPUs with shared caches, loop tiling/blocking/unrolling prefetching, data structure layout in the cache etc…

No tool is going to be smart enough to squeeze every last ounce of performance out of a system for you.

This is generally the approach that I take:

Determine the most computationally intensive part of your application either via algorithmic analysis or profiling.
Determine the factor most likely limiting performance (compute vs memory bound comes to mind, but communication via atomics/synchronization can also be an issue. Do this by analyzing your algorithm.
Start with a skeleton program that hits the machine limit that will eventually bound you, but doesn’t do any useful work.
Incrementally add in functionality until you have a complete application. After you add each component, benchmark the program to see if you are still hitting your limit. If you are, fine, you haven’t exhausted your headroom yet. If your performance degrades, determine why, decide whether or not the component that you just added is worth the overhead. Possibly try to tune that component to get closer to your limit.

Be cognisant of of the optimizations that you have available (coalescing, unrolling, shared memory caching, register pressure, occupancy, etc), but don’t spend time on them if you can do something simple and hit the machine limit.

If nothing works in terms of low level optimizations, bump up a level and re-architect your algorithm.

Topic		Replies	Views
Profiling a computationally bound kernel CUDA Programming and Performance	1	2948	May 19, 2009
how to evaluate the CUDA's performance how can i know the program is optimazed CUDA Programming and Performance	7	7338	July 24, 2008
Maximising Perfromance of a Application CUDA Programming and Performance	3	1874	May 11, 2012
Optimize - Many small operations (CPU is faster for now?) CUDA Programming and Performance	2	512	July 11, 2019
Seeking an Efficient Way to Debug CUDA Kernels CUDA Programming and Performance	4	1493	November 11, 2022
Kernel calculation optimization Best way to perform low level calculatio CUDA Programming and Performance	2	2910	September 4, 2008
Is there any tool which can tell my kernel is compute bound or memory bound CUDA Programming and Performance	7	5998	April 3, 2010
Strategies for implementing a large algorithm in C CUDA Programming and Performance	3	4057	September 3, 2008
analysis inside kernel CUDA Programming and Performance	2	1434	July 2, 2012
CUDA coding philosophy CUDA Programming and Performance	1	625	May 31, 2011

Techniques for Kernel Optimization

Related topics