Reconfiguring the cache / shared memory on a Fermi understanding the cudaFuncSetCacheConfig command

Hi,

I’ve just installed my GTX 480 and trying to benchmark it. The application I’m using required about 124 registers on the GT200 generation in double precision. Since NVIDIA decided to limit the number registers of Fermi to only 64 registers per thread, the end result is that Fermi is slower than GT200 at double precision for my code :-(

This is using the default cache configuration of 48k / 16k for shared / L1. Switching to 16k / 48 k may help me since the registers will be spilled to the large L1, instead of out to L2 or device memory. Looking at the programming guide, to do this one uses the cudaFuncSetCacheConfig command. From the programming guide:

[codebox]// Device code

global void MyKernel() { … }

// Host code

// Runtime API

// cudaFuncCachePreferShared: shared memory is 48 KB

// cudaFuncCachePreferL1: shared memory is 16 KB

// cudaFuncCachePreferNone: no preference

cudaFuncSetCacheConfig(MyKernel, cudaFuncCachePreferShared)

// Driver API

// CU_FUNC_CACHE_PREFER_SHARED: shared memory is 48 KB

// CU_FUNC_CACHE_PREFER_L1: shared memory is 16 KB

// CU_FUNC_CACHE_PREFER_NONE: no preference

CUfunction myKernel;

cuFuncSetCacheConfig(myKernel, CU_FUNC_CACHE_PREFER_SHARED)[/codebox]

However, this does not compile, and instead I have to surround MyKernel by quotes, i.e., “MyKernel”, to get it to compile. When running though, I get the following error:

[codebox]

error: no instance of overloaded function “cudaFuncSetCacheConfig” matches the argument list argument types are: (, cudaFuncCache)[/codebox]

Can someone point out what I’m doing wrong here?

Thanks.

I take it no-one no the forums used this feature then? I thought the re-configurable cache was one of the selling points of the Fermi, and so knowing how to successfully do this is something we should know.

Thanks.

Do you use driver API?

I’ve played with it a little bit, but haven’t had the time to fully explore the new cache and update my applications yet. The problem with having a dozen things to work on all at once…

I also don’t get the error that you do:

# test.cu

__global__ void MyKernel()

	{

	}

int main()

	{

	cudaFuncSetCacheConfig(MyKernel, cudaFuncCachePreferShared);

	return 0;

	}
$ nvcc -c test.cu

$

This is with CUDA 3.0 on an iMac, but I’ve done the same with CUDA 3.0 in linux at home. The setting defiantly works, the few tests I have run do have performance differences with different cache preference settings. It even works with templated kernels, you just have to list the fully template instantiation in the call to cudaFuncSetCacheConfig, like this:

cudaFuncSetCacheConfig(MyKernel<int, float, 5>, cudaFuncCachePreferShared);

Strange, this function is declared with (const char*) parameter.

The kernel for which I am trying to do this uses polymorphism but not templates. I wonder if the polymorphism is causing my problems. This is unlikely though, I would have assumed that when using polymorphism, cache settings would be set uniformly between all kernels with a common name, but that you would still have the ability to change the cache settings.

The documentation for the const char* version states that you need to specify the fully mangled c++ name of the function you wish to change the setting for. That indicates you would not get the behavior you wish for.

I have no idea what magic the templated version that takes the function pointer does… Perhaps this is the cause of your difficulty. If you have two different functions with the same name, perhaps the function pointer cannot be resolved to either one of them.

16384/124 ~= 132 – hmm… may not b gr8
8192/124 ~= 64 – not so gr8

I haven’t studied all the Fermi documentation, so the answer the following may be in there. Do you know if there is a way to reconfigure the cache for all kernels universally, rather than on a kernel by kernel basis? I would imagine I could set a default flag upon device initialisation that would be set for all kernels, unless cudaFuncSetCacheConfig was set. This seem like my best hope at the moment, given the trouble I’m having with my polymorphic kernels.

Tim stated in another thread that if you only set a single kernel to prefer L1, that setting will stick permanently. You could create a dummy empty kernel, set it to L1, and call it at the beginning of your program.

Btw, can I obtain current status of configuration?

Thanks for the suggestion. This works. It hasn’t helped my performance woes, but that’s another story…

Are you expecting to get better double precision performance with fermi? Apparently, Nvidia has chosen not to give us the extra double precision performance on GeForce products but only on the Tesla products. There is another thread floating around that talks about this. Wanted to make sure you knew before spending too much time trying to get your 480 to run better than your old 280.

The GTX480 has massively better DP throughput than the GTX285… almost double.

GTX285: 88.5 Gflops

GTX480: 168 Gflops

S2050: 515 Gflops

It improved the performance for double precision in my case.

The key is you need to run the dummy kernel after you set the cache preference.

I only get this error

error: identifier “cudaFuncCachePreferShared” is undefined

error: identifier “cudaFuncSetCacheConfig” is undefined

?

I have the same Problem with cudaFuncCachePreferL1.

I define a kernel

__global__ void dummy(){}

and set

cudaFuncSetCacheConfig("dummy", cudaFuncCachePreferL1);

but the compiler return an error

cudaErrorInvalidDeviceFunction

What is wrong on this code?

I have done some experiments with cache configuration and found that two syntax worked:

  • use the complete mangled name of your kernel as a C string

=> you have something like that cudaFuncSetCacheConfig(“_Z11_dummy…”, cudaFuncCachePreferL1);

  • use directly the name of the kernel, but not as a C string

=> cudaFuncSetCacheConfig(dummy, cudaFuncCachePreferL1);

I also confirm that the cache configuration is persistent.

Thank you for the quick response! Your solution work fine…

Instead of reconfiguring shared memory to become L1 cache and catch more of the spilled vars, maybe you can declare some of your local vars as (tid-indexed arrays in) shared memory to explicitly place them in the on-chip memory.