IPP-like mechanism for loading the 'best' implementation for the specific GPU arhcitecture ?

Is there a way to instruct the NVCC compiler to compile for different compute capabilities and then at runtime automatically the best implementation is loaded by the GPU driver, depending on the compute capability of the GPU ?
E.g. in the Intel performance primitives (IPP) library a similar mechanism is implemented (but by the library developers itself).

So i have different implementations e.g. for fermi architecture (CC 2.0) and prior architectures (< CC 2.0) in the source code files via

#if (CUDA_ARCH < 200)
// kernel implementation for compute cability < 2.0
#else
// kernel implementation for compute cability >= 2.0
#endif

and would like to get rid of manually checking the CC of the device on which the kernel is executed and then calling the approbiate kernel.

Using #if (CUDA_ARCH < 200) works if the kernel takes the same arguments, but sometimes you need to change the algorithm slightly between implementations. For that, implement an _sm20 and _sm10 version of your kernel, use #if (CUDA_ARCH >= 200) just to get the _sm20 one to compile, and use the device properties of the current GPU to determine which code path to take.

Look at the [font=“Courier New”]–generate-code arch=xx,code=yy[/font] option to nvcc to generate fat binaries for multiple GPU architectures (giving the option multiple times will cause multiple compiler invocations, one for each architecture). The best fit for the actual GPU will then automatically be selected at runtime.