Techniques for Kernel Optimization

How does one really go about optimizing a kernel in practice, i.e. it’s my job. The CUDA Profiler is ok but does not give all the information I need and on top of that it is very buggy. It also crashes when I try to run anything but the smallest simulation.

My strategy so far has been to use guidance from the programming guide and Kirk and Hwu’s book to try to guess what is holding me back, and then try it. Unfortunately this is starting to feel like I’m just taking shots in the dark. The best example of this is that the implementation of my FDTD algorithm with shared memory is actually much slower than a global memory only implementation.

This is becoming very frustrating, and it’s not at all clear to me that CUDA is well developed enough for use by anyone other than GPU experts with significant experience. Where are the tools that I really need to get the job done? Does it require a true CUDA/GPU expert to create simulations which are really fast and useful?

If you try to take a look at writing high performance codes for CPUs, you run into similar problems. You still have to worry about things like float2Int being significantly slower on netburst vs core architecture, cache contention using multi-threaded applications on newer CPUs with shared caches, loop tiling/blocking/unrolling prefetching, data structure layout in the cache etc…

No tool is going to be smart enough to squeeze every last ounce of performance out of a system for you.

This is generally the approach that I take:

  1. Determine the most computationally intensive part of your application either via algorithmic analysis or profiling.

  2. Determine the factor most likely limiting performance (compute vs memory bound comes to mind, but communication via atomics/synchronization can also be an issue. Do this by analyzing your algorithm.

  3. Start with a skeleton program that hits the machine limit that will eventually bound you, but doesn’t do any useful work.

  4. Incrementally add in functionality until you have a complete application. After you add each component, benchmark the program to see if you are still hitting your limit. If you are, fine, you haven’t exhausted your headroom yet. If your performance degrades, determine why, decide whether or not the component that you just added is worth the overhead. Possibly try to tune that component to get closer to your limit.

Be cognisant of of the optimizations that you have available (coalescing, unrolling, shared memory caching, register pressure, occupancy, etc), but don’t spend time on them if you can do something simple and hit the machine limit.

If nothing works in terms of low level optimizations, bump up a level and re-architect your algorithm.