In the code I’m running, I’m almost never using the 48kB of shared memory available on my 2.0 card. I also checked that I can benefit from using 48kB of cache rather than just 16kB. However, as this is not the default setting, I do have to call the cudaFuncSetCacheConfig() function before I call my kernels. My question is thus rather simple, what is the overhead caused by the call to this function ? And is there a way to set the cache config once for all subsequent calls to any kernel ? Could this be done at compile time ?
Thanks in advance for any advice.