Application optimization which will run on both Kepler and Maxwell

At this point I have two distinct CUDA/C executeables which will be run on Windows 7 machines with either a GTX 980 or a GTX 780ti.

I compiled two different applications not only because of the code generation difference between Maxwell and Kepler, but also because of the way shared memory is handled by Maxwell, and because of the superior memory bandwidth of the GTX 780ti.

So in other words the kernel codes are different and optimized and compiled for each GPU. Also since the GTX 780ti has less memory I had to make some adjustments to memory management.

Overall the sum total running time for the two applications are roughly the same, but the different pieces of the application have varying running times depending mainly on how much of the work is memory bandwidth dependent. This application is about 50% compute bound and 50% bandwidth bound. It is a large application which can take up to 5 minutes to run using full device resources for the majority of the time.

My question then would be, is there a guide which covers how to handle different hardware across different generations of GPUs with different amounts of device memory(assume Kepler and up)?

I know that the examples in the CUDA SDK does a good job of this, and I would like to learn from them. In the end I would prefer not to have a bunch of different files for each type of possible GPU.

Any simple examples of a all-in one approach?

I think the best approach is very much a question of your use case (e.g., how extensive are the differences) and personal taste.

My first preference is always to have as little variant code as possible. This can require a lot of thought to find a good compromise that works reasonably close to optimal across multiple architectures and across multiple other parameters (e.g. matrix sizes, matrix aspect ratios). Of course, that is just not always possible, that is, there will be cases where truly generic code leaves too much performance on the table.

In those cases, I first try to construct unified kernels with variant local code driven by template parameters or (increasingly less frequent these days), #if CUDA_ARCH, possibly with that code extracted into inline functions or macros. This allows maintenance of a single code base, but if there is a large number of local differences, or significant algorithmic differences between variants (e.g. for DP-rich architectures vs DP-poor architectures), such code can become hard to read and maintain, so that at that point I switch to multiple kernels with separate code bases.

The selection of kernel variants at run time (whether due to architectural reasons or some other parameter, and whether generated from a unified code base or separate code bases) I usually perform via functions pointers that are initialized at application or library initialization time, based on heuristics. If there are just a few variants (say, up to four) I might also use a simple if-then-else construct or switch() statement.

I will try similar approach to what you suggested.

I do have the advantage of not needing DP, and of not needing to deal with older GPUs. The main issue of course is memory management and bandwidth, but I imagine that code generated for the GTX 980 will perform better without adjustment on the (assumed) Maxwell upcoming GTX 980ti or the GTX Titan 2.

Will use the code samples in the CUDA SDK as references.