How to use cudaFuncSetCacheConfig() correctly ? One of the most anticipating features does not seem

Romant · June 18, 2010, 9:53pm

Hi All,

What I’m trying to do is to benchmark my GTX 480 device in 48/16 and 16/48 configurations but I can’t event find out how to make sure that requested configuration is actually assigned to a kernel.

For example, before launching the kernel I compute the maximal number of threads per block using the information from cudaDeviceProp structure. No matter which cache configuration I set, after filling in the cudaDeviceProp structure with cudaGetDeviceProperties() I see that sharedMemPerBlock member is always equal to 48K.

SDK examples do not contain cudaFuncSetCacheConfig() calls either. Could anybody explain how to deal with this functionality in a right way please ?

Thanks in advance.

Simon_Green · June 21, 2010, 1:39pm

I don’t think there’s a query to get the cache configuration for a kernel.

cudaFuncSetCacheConfig() sets the cache configuration for a given kernel.

cudaGetDeviceProperties() returns the maximum shared memory supported by the device, regardless of the kernel settings.

If you’re not seeing any performance difference then the cache probably isn’t helping your application.

Romant · June 23, 2010, 8:31am

I’ll try to show the increase of speed of execution of test run for different versions of my kernel (it makes lots of non-coalesced reads from global memory) on different hardware.

GT200, textures are not used: base point, 100% speed.
GT200, global memory reads are cached via textures (tex1DFetch): +40% to (1)
GF100, code is the same with (2), cudaFuncSetCacheConfig() is not called at all: +60% to (2)
GF100, textures are no longer used, cudaFuncSetCacheConfig() is not called at all: +20% to (3)
GF100, code is the same with (4), cudaFuncSetCacheConfig(cudaFuncCachePreferL1): +0% to (4)

It is clear that the more flexible and advanced the mechanism of caching the faster my kernel (I’ll repeat that it’s productivity is limited by the ability of the device to make non-coalesced reads from global memory). Simply can’t get how it is possible that 3 times bigger cache gives absolutely nothing, not even a percent of speed increase.

Even when I make cudaFuncSetCacheConfig(cudaFuncCachePreferL1) call kernels that require more than 16K of shared memory are still launched without errors. How is it possible if 48K of memory should be dedicated for cache ? I still have strong impression that, for some reason, this options does not work in my case. This impression becomes stronger due to the absence of any code samples regarding this feature in SDK, I can’t even check how it should be actually implemented and on which data with which kernel code this feature becomes beneficial.

Really hope to get some help on this as I’m currently deciding whether GF100 is remarkably better than GTX295 for my tasks.

eyalhir74 · June 23, 2010, 9:17am

I’ll try to show the increase of speed of execution of test run for different versions of my kernel (it makes lots of non-coalesced reads from global memory) on different hardware.

GT200, textures are not used: base point, 100% speed.

GT200, global memory reads are cached via textures (tex1DFetch): +40% to (1)

GF100, code is the same with (2), cudaFuncSetCacheConfig() is not called at all: +60% to (2)

GF100, textures are no longer used, cudaFuncSetCacheConfig() is not called at all: +20% to (3)

GF100, code is the same with (4), cudaFuncSetCacheConfig(cudaFuncCachePreferL1): +0% to (4)

It is clear that the more flexible and advanced the mechanism of caching the faster my kernel (I’ll repeat that it’s productivity is limited by the ability of the device to make non-coalesced reads from global memory). Simply can’t get how it is possible that 3 times bigger cache gives absolutely nothing, not even a percent of speed increase.

This probably boils down to what Simon said ( :) ).

If your application is mostly bandwidth bounded and the access pattern is random/semi-random the Cache will be of little use

while the textures will be of great help - this is what I also see.

And therefore the performance boost you see is almost merly from the increase in gmem access speed from GT200 to GT100.

Section G.4.1 states:

“The same on-chip memory is used for both L1 and shared memory: It can be configured as 48 KB of shared memory with 16 KB of L1 cache (default setting)”

48K for shared memory is the default.

Romant · June 23, 2010, 12:33pm

Fastest configuration is: “4. GF100, textures are no longer used, cudaFuncSetCacheConfig() is not called at all: +20% to (3)”. So my kernel works better WITHOUT textures than with them.

I know that 48K is the default. What I said is that after I have chosen the L1 preference (and thus should be able to work with only 16K of shared memory) kernels that require significantly more than 16K still can be launched without problems, that’s a bit strange if 48K are actually dedicated to cache and only 16K as shared memory.

gshi · June 23, 2010, 2:07pm

Did you call the kernel used in cudaFuncSetCacheConfig()?

Romant · June 23, 2010, 2:45pm

Strange question … of course I did. What else would I run and benchmark ?

gshi · June 23, 2010, 5:27pm

Quote from reference manual

"

On devices where the L1 cache and shared memory use the same hardware resources, this sets through cacheConfig

the preferred cache configuration for the function specified via func. This is only a preference. The runtime will use

the requested configuration if possible, but it is free to choose a different configuration if required to execute func."

Romant · June 23, 2010, 5:34pm

I have tried to limit the amount of shared memory used by kernel by 16K to prevent the “free to choose a different configuration” statement … no effect. Even when none of kernels exceed 16K shared memory limitation I see no effect of potentially 3 times larger L1 cache.