Shared Mem size and Shared vs L1 Cache memory

Hello,

Allocating anything over 16k of shared memory produces a compile error like this one:

ptxas error : Entry function ‘_Z12MyKernelPlPfS0_S0_iiilli’ uses too much shared data (0x88f8 bytes + 0x10 bytes system, 0x4000 max)

According to this error log, the max shared memory allowed is 16k. 48k is supposed to be allocated ot shared mem by default. In addition to that, I should be able to specify either 16k or 48k to shared using ‘cudaFuncSetCacheConfig’. The manual states that 2.0 compute devices should all have the flexibility to specify the memory allocation preference. I’m running on GTX480, and so the error is an all out mystery to me.

The following dummy kernel will produce the said error if 4*(number of elements in ShrStuff)>16k:

global void MyKernel(float *Stuff, long ncntr) {
shared float ShrStuff[5000];

unsigned short dqt,ii;
for (ii=0;ii<10;ii++){
    dqt=threadIdx.x*10+ii;
    if (dqt<5000)
        ShrStuff[dqt]=threadIdx.x*1.0f;
}
__syncthreads();  

long cntr= blockDim.x * blockIdx.x + threadIdx.x;
if (cntr>ncntr-1) return;

Stuff[cntr]=cntr*.5f*ShrStuff[10];

}// End oF Kernel

//host code

ncntr=10000;

dim3 block(512);
dim3 grid(ceil(ncntr/float(NumThr)));

cudaFuncSetCacheConfig(MyKernel, cudaFuncCachePreferShared);
MyKernel<<< grid, block >>>(Stuff,ncntr);

Any ideas theories would be helpful.
Thanks in advance, Joe.

Hello,

Allocating anything over 16k of shared memory produces a compile error like this one:

ptxas error : Entry function ‘_Z12MyKernelPlPfS0_S0_iiilli’ uses too much shared data (0x88f8 bytes + 0x10 bytes system, 0x4000 max)

According to this error log, the max shared memory allowed is 16k. 48k is supposed to be allocated ot shared mem by default. In addition to that, I should be able to specify either 16k or 48k to shared using ‘cudaFuncSetCacheConfig’. The manual states that 2.0 compute devices should all have the flexibility to specify the memory allocation preference. I’m running on GTX480, and so the error is an all out mystery to me.

The following dummy kernel will produce the said error if 4*(number of elements in ShrStuff)>16k:

global void MyKernel(float *Stuff, long ncntr) {
shared float ShrStuff[5000];

unsigned short dqt,ii;
for (ii=0;ii<10;ii++){
    dqt=threadIdx.x*10+ii;
    if (dqt<5000)
        ShrStuff[dqt]=threadIdx.x*1.0f;
}
__syncthreads();  

long cntr= blockDim.x * blockIdx.x + threadIdx.x;
if (cntr>ncntr-1) return;

Stuff[cntr]=cntr*.5f*ShrStuff[10];

}// End oF Kernel

//host code

ncntr=10000;

dim3 block(512);
dim3 grid(ceil(ncntr/float(NumThr)));

cudaFuncSetCacheConfig(MyKernel, cudaFuncCachePreferShared);
MyKernel<<< grid, block >>>(Stuff,ncntr);

Any ideas theories would be helpful.
Thanks in advance, Joe.

It’s a compiletime error, so it has nothing to do with your actual hardware.

I suspect that you’re not compiling with -arch sm_20. Shared memory is limited to 16KB when you compile for compute 1.3 and earlier.

It’s a compiletime error, so it has nothing to do with your actual hardware.

I suspect that you’re not compiling with -arch sm_20. Shared memory is limited to 16KB when you compile for compute 1.3 and earlier.

Ah, thank you sir. I threw in -arch=compute_20, as well as -code=sm_20, for good measure. Now it compiles! But… all my code compiled with those flags now runs 50% slower than before :/ Any ideas why that would happen? I suppose I’ll hit the nvcc guide for some detail, but it seems odd that this ‘improvement’ would cause such a slowdown.

Ah, thank you sir. I threw in -arch=compute_20, as well as -code=sm_20, for good measure. Now it compiles! But… all my code compiled with those flags now runs 50% slower than before :/ Any ideas why that would happen? I suppose I’ll hit the nvcc guide for some detail, but it seems odd that this ‘improvement’ would cause such a slowdown.

The most likely answer is that the larger shared use per block means your SMs can’t run as many blocks simultaneously as before, so you have lower occupancy, leading to wasted compute.

The most likely answer is that the larger shared use per block means your SMs can’t run as many blocks simultaneously as before, so you have lower occupancy, leading to wasted compute.

Well, actually, what I meant is that the code where L1 mem was set was running slower as well. And the reason for it is that once I specified sm20, I also had to explicitely specify the use of fast math. Without the specified -arch, the use of fast math must have been default. So, adding -use_fast_math got me back to the original performance. I guess it’s good to know.

Thanks again for your feedback, SP.

Well, actually, what I meant is that the code where L1 mem was set was running slower as well. And the reason for it is that once I specified sm20, I also had to explicitely specify the use of fast math. Without the specified -arch, the use of fast math must have been default. So, adding -use_fast_math got me back to the original performance. I guess it’s good to know.

Thanks again for your feedback, SP.