What I’m trying to do is to benchmark my GTX 480 device in 48/16 and 16/48 configurations but I can’t event find out how to make sure that requested configuration is actually assigned to a kernel.
For example, before launching the kernel I compute the maximal number of threads per block using the information from cudaDeviceProp structure. No matter which cache configuration I set, after filling in the cudaDeviceProp structure with cudaGetDeviceProperties() I see that sharedMemPerBlock member is always equal to 48K.
SDK examples do not contain cudaFuncSetCacheConfig() calls either. Could anybody explain how to deal with this functionality in a right way please ?
I’ll try to show the increase of speed of execution of test run for different versions of my kernel (it makes lots of non-coalesced reads from global memory) on different hardware.
GT200, textures are not used: base point, 100% speed.
GT200, global memory reads are cached via textures (tex1DFetch): +40% to (1)
GF100, code is the same with (2), cudaFuncSetCacheConfig() is not called at all: +60% to (2)
GF100, textures are no longer used, cudaFuncSetCacheConfig() is not called at all: +20% to (3)
GF100, code is the same with (4), cudaFuncSetCacheConfig(cudaFuncCachePreferL1): +0% to (4)
It is clear that the more flexible and advanced the mechanism of caching the faster my kernel (I’ll repeat that it’s productivity is limited by the ability of the device to make non-coalesced reads from global memory). Simply can’t get how it is possible that 3 times bigger cache gives absolutely nothing, not even a percent of speed increase.
Even when I make cudaFuncSetCacheConfig(cudaFuncCachePreferL1) call kernels that require more than 16K of shared memory are still launched without errors. How is it possible if 48K of memory should be dedicated for cache ? I still have strong impression that, for some reason, this options does not work in my case. This impression becomes stronger due to the absence of any code samples regarding this feature in SDK, I can’t even check how it should be actually implemented and on which data with which kernel code this feature becomes beneficial.
Really hope to get some help on this as I’m currently deciding whether GF100 is remarkably better than GTX295 for my tasks.
Fastest configuration is: “4. GF100, textures are no longer used, cudaFuncSetCacheConfig() is not called at all: +20% to (3)”. So my kernel works better WITHOUT textures than with them.
I know that 48K is the default. What I said is that after I have chosen the L1 preference (and thus should be able to work with only 16K of shared memory) kernels that require significantly more than 16K still can be launched without problems, that’s a bit strange if 48K are actually dedicated to cache and only 16K as shared memory.
I have tried to limit the amount of shared memory used by kernel by 16K to prevent the “free to choose a different configuration” statement … no effect. Even when none of kernels exceed 16K shared memory limitation I see no effect of potentially 3 times larger L1 cache.