First of all, this is NOT about portability between CPU and GPU or between ATI and nVidia. This is ONLY about portability between different CUDA versions and different nVidia architectures.
My CUDA-based code runs on GTX200 architecture. I would like to run the same source code (but possibly a different executable) on Fermi.
I would also like the same source code to compile on both CUDA 2.x and CUDA 3.x, at least for the time being.
I have a few questions:
Do I need to compile the code with different flags, so that it utilizes Fermi more efficiently? I don’t need any Fermi features, that are not available on GTX200. I don’t need double precision. Of course, I’d prefer to have the same build flags for both GTX200 and Fermi, unless performance is worse on Fermi under generic flags.
I would like to configure the device to offer more shared memory at the expense of cache. I found function cudaThreadSetCacheConfig in CUDA 3.2 that seems relevant to my task. As far as I understand, that function is not available in earlier versions of CUDA.
Since cudaThreadSetCacheConfig(…) is not in CUDA 2.x, and I need to compile within that framework, I imagine adding something like
#if CUDA_VERISON > 3.0
to the initialization section of my code, e.g. where I invoke cudaSetDeviceFlags(cudaDeviceBlockingSync) [BTW, does the order of these two invocations matter?]
What macro should I use to determine that CUDA version is above 3.0?
Is the call to cudaThreadSetCacheConfig valid on pre-Fermi architectures in CUDA 3.x environments? Should I test for the device properties before this call?
The doc for cudaThreadSetCacheConfig states that the GPU does not guarantee that the requested shared memory/cache policy is honored, even on Fermi. Is there a stricter function, that actually guarantees that the device’s shared/cache memory is utilized the way I want, or returns an error?