Hi,
I’ve just installed my GTX 480 and trying to benchmark it. The application I’m using required about 124 registers on the GT200 generation in double precision. Since NVIDIA decided to limit the number registers of Fermi to only 64 registers per thread, the end result is that Fermi is slower than GT200 at double precision for my code :-(
This is using the default cache configuration of 48k / 16k for shared / L1. Switching to 16k / 48 k may help me since the registers will be spilled to the large L1, instead of out to L2 or device memory. Looking at the programming guide, to do this one uses the cudaFuncSetCacheConfig command. From the programming guide:
[codebox]// Device code
global void MyKernel() { … }
// Host code
// Runtime API
// cudaFuncCachePreferShared: shared memory is 48 KB
// cudaFuncCachePreferL1: shared memory is 16 KB
// cudaFuncCachePreferNone: no preference
cudaFuncSetCacheConfig(MyKernel, cudaFuncCachePreferShared)
// Driver API
// CU_FUNC_CACHE_PREFER_SHARED: shared memory is 48 KB
// CU_FUNC_CACHE_PREFER_L1: shared memory is 16 KB
// CU_FUNC_CACHE_PREFER_NONE: no preference
CUfunction myKernel;
cuFuncSetCacheConfig(myKernel, CU_FUNC_CACHE_PREFER_SHARED)[/codebox]
However, this does not compile, and instead I have to surround MyKernel by quotes, i.e., “MyKernel”, to get it to compile. When running though, I get the following error:
[codebox]
error: no instance of overloaded function “cudaFuncSetCacheConfig” matches the argument list argument types are: (, cudaFuncCache)[/codebox]
Can someone point out what I’m doing wrong here?
Thanks.