Shared Mem size and Shared vs L1 Cache memory

Joe_Fatmama · November 17, 2010, 2:11am

Hello,

Allocating anything over 16k of shared memory produces a compile error like this one:

ptxas error : Entry function ‘_Z12MyKernelPlPfS0_S0_iiilli’ uses too much shared data (0x88f8 bytes + 0x10 bytes system, 0x4000 max)

According to this error log, the max shared memory allowed is 16k. 48k is supposed to be allocated ot shared mem by default. In addition to that, I should be able to specify either 16k or 48k to shared using ‘cudaFuncSetCacheConfig’. The manual states that 2.0 compute devices should all have the flexibility to specify the memory allocation preference. I’m running on GTX480, and so the error is an all out mystery to me.

The following dummy kernel will produce the said error if 4*(number of elements in ShrStuff)>16k:

global void MyKernel(float *Stuff, long ncntr) {
shared float ShrStuff[5000];

unsigned short dqt,ii;
for (ii=0;ii<10;ii++){
    dqt=threadIdx.x*10+ii;
    if (dqt<5000)
        ShrStuff[dqt]=threadIdx.x*1.0f;
}
__syncthreads();  

long cntr= blockDim.x * blockIdx.x + threadIdx.x;
if (cntr>ncntr-1) return;

Stuff[cntr]=cntr*.5f*ShrStuff[10];

}// End oF Kernel

//host code

ncntr=10000;

dim3 block(512);
dim3 grid(ceil(ncntr/float(NumThr)));

cudaFuncSetCacheConfig(MyKernel, cudaFuncCachePreferShared);
MyKernel<<< grid, block >>>(Stuff,ncntr);

Any ideas theories would be helpful.
Thanks in advance, Joe.

Joe_Fatmama · November 17, 2010, 2:11am

Hello,

Allocating anything over 16k of shared memory produces a compile error like this one:

ptxas error : Entry function ‘_Z12MyKernelPlPfS0_S0_iiilli’ uses too much shared data (0x88f8 bytes + 0x10 bytes system, 0x4000 max)

According to this error log, the max shared memory allowed is 16k. 48k is supposed to be allocated ot shared mem by default. In addition to that, I should be able to specify either 16k or 48k to shared using ‘cudaFuncSetCacheConfig’. The manual states that 2.0 compute devices should all have the flexibility to specify the memory allocation preference. I’m running on GTX480, and so the error is an all out mystery to me.

The following dummy kernel will produce the said error if 4*(number of elements in ShrStuff)>16k:

global void MyKernel(float *Stuff, long ncntr) {
shared float ShrStuff[5000];

unsigned short dqt,ii;
for (ii=0;ii<10;ii++){
    dqt=threadIdx.x*10+ii;
    if (dqt<5000)
        ShrStuff[dqt]=threadIdx.x*1.0f;
}
__syncthreads();  

long cntr= blockDim.x * blockIdx.x + threadIdx.x;
if (cntr>ncntr-1) return;

Stuff[cntr]=cntr*.5f*ShrStuff[10];

}// End oF Kernel

//host code

ncntr=10000;

dim3 block(512);
dim3 grid(ceil(ncntr/float(NumThr)));

cudaFuncSetCacheConfig(MyKernel, cudaFuncCachePreferShared);
MyKernel<<< grid, block >>>(Stuff,ncntr);

Any ideas theories would be helpful.
Thanks in advance, Joe.

SPWorley · November 17, 2010, 2:58am

It’s a compiletime error, so it has nothing to do with your actual hardware.

I suspect that you’re not compiling with -arch sm_20. Shared memory is limited to 16KB when you compile for compute 1.3 and earlier.

SPWorley · November 17, 2010, 2:58am

It’s a compiletime error, so it has nothing to do with your actual hardware.

I suspect that you’re not compiling with -arch sm_20. Shared memory is limited to 16KB when you compile for compute 1.3 and earlier.

Joe_Fatmama · November 17, 2010, 4:21am

Ah, thank you sir. I threw in -arch=compute_20, as well as -code=sm_20, for good measure. Now it compiles! But… all my code compiled with those flags now runs 50% slower than before External Image Any ideas why that would happen? I suppose I’ll hit the nvcc guide for some detail, but it seems odd that this ‘improvement’ would cause such a slowdown.

Joe_Fatmama · November 17, 2010, 4:21am

Ah, thank you sir. I threw in -arch=compute_20, as well as -code=sm_20, for good measure. Now it compiles! But… all my code compiled with those flags now runs 50% slower than before External Image Any ideas why that would happen? I suppose I’ll hit the nvcc guide for some detail, but it seems odd that this ‘improvement’ would cause such a slowdown.

SPWorley · November 17, 2010, 5:29am

The most likely answer is that the larger shared use per block means your SMs can’t run as many blocks simultaneously as before, so you have lower occupancy, leading to wasted compute.

SPWorley · November 17, 2010, 5:29am

The most likely answer is that the larger shared use per block means your SMs can’t run as many blocks simultaneously as before, so you have lower occupancy, leading to wasted compute.

Joe_Fatmama · November 17, 2010, 1:24pm

Well, actually, what I meant is that the code where L1 mem was set was running slower as well. And the reason for it is that once I specified sm20, I also had to explicitely specify the use of fast math. Without the specified -arch, the use of fast math must have been default. So, adding -use_fast_math got me back to the original performance. I guess it’s good to know.

Thanks again for your feedback, SP.

Joe_Fatmama · November 17, 2010, 1:24pm

Well, actually, what I meant is that the code where L1 mem was set was running slower as well. And the reason for it is that once I specified sm20, I also had to explicitely specify the use of fast math. Without the specified -arch, the use of fast math must have been default. So, adding -use_fast_math got me back to the original performance. I guess it’s good to know.

Thanks again for your feedback, SP.

Topic		Replies	Views
issue using cudaFuncSetCacheConfig setting cudaFuncSetCacheConfig(MyKernel, cudaFuncCachePreferShare CUDA Programming and Performance	1	964	November 16, 2010
how to use 48k-shared memory in gtx480? CUDA Programming and Performance	25	9119	February 18, 2011
Fermi Shared Memory CUDA Programming and Performance	2	4031	July 22, 2010
Only 16Kbyte shared memory on GTX480 CUDA Programming and Performance	4	1161	September 15, 2010
GeForce GTS 450 - max shared memory CUDA Programming and Performance	2	1325	August 19, 2011
Strange ptxas error in shared memory CUDA Programming and Performance	7	9244	February 24, 2009
using too much shared memory error CUDA Programming and Performance	12	17090	August 11, 2010
Max shared memory CUDA Programming and Performance	0	1320	July 28, 2020
how to use 48k shared memory in GTX 480 CUDA Programming and Performance	1	592	January 4, 2011
can I use shared memory of size 3600*sizeof(double) on <<<1,1>>> config? CUDA Programming and Performance	4	1255	July 21, 2009

Shared Mem size and Shared vs L1 Cache memory

Related topics