"cudaDeviceSetSharedMemConfig" and "cudaDeviceSetCacheConfig" has no effect

I’m trying to optimize shared memory for a cuda code on GTX 1080. To do so, I want to change the shared memory bank width and cache configuration by calling:




Where do I call these functions? Currently, I call them in a host function that uses "cudaLaunchCooperativeKernel to call a global function:

template< ... > bool launch_dualBlock(...){
    cudaLaunchCooperativeKernel( (void*)nv_wavenet_dualBlock<...>, grid, block ... )

definition of nv_wavenet_dualBlock is:

template< ... > __global__ void nv_wavenet_dualBlock( ... ){
    nv_wavenet_dualBlock_A< ... >( ... );

and nv_wavenet_dualBlock_A is a device function.

However, the two function calls seem to do nothing because when I print shared memory and cache configuration after I call the two functions, the printed values indicate that nothing changed. Also, I check the return value of the two functions and they are both cudaSuccess.

I would really appreciate your help.

Instead of cudaDeviceSetSharedMemConfig, use a call like this:

cudaFuncSetSharedMemConfig(kernel_func_name, cudaSharedMemBankSizeEightByte);

It will select the 8 bytes shared memory size for the specific functions you define. This way you make sure that just these functions will use 8 bytes, while other functions will continue to use the standard 4 bytes.
You will call it just before you call the kernel function(s). It is just telling the compiler “for this, this and that functions, the shared memory size is 8 bytes instead of 4”, nothing else.

If you are not storing double precision data in this shared memory now defined as 8 bytes, then nothing will happen, except for bank conflicts and a slower running function.

8 byte bank mode is not available and has no effect on Pascal devices

Good afternoon, Robert.

Thanks for this input, and since I was going to ask this anyway, I will quickly hijack the topic.
The reason I suggested this is, even with the documentation saying it has no effect on Pascal, when I profile a program that I employ a shared memory type of 8 bytes, the kernel analysis says that ShMem is 4 bytes. When I set ShMem to 8 bytes, then the profiler reports the kernel as using 8 bytes.
The program runs fine and provides the correct result in both cases, and since I expect to run it on Kepler too, then I assumed there is no harm in leaving the 8 bytes call there.

Unless there is an adverse effect, it is fine otherwise?