Strange ptxas error in shared memory

In my kernel I’ve allocated 4 shared memory locations of 32x32 elements, each one of 4 bytes (data type is float)

After launching the compiler, I receive back this error:

ptxas error  : Entry function '_Z13LU_updaterestPfii' uses too much shared data (0x1000 bytes + 0x301c bytes system, 0x4000 max)

But:

32324*4 = 16384 Bytes = 16 kB and shared memory is max 16kB..!!!

Why I receive this error?

My kernel is:

__global__ void LU_updaterest(float* a, int step, int W)

{

  int i, j;

  float acc;

int tx = threadIdx.x;

  int bx = blockIdx.x;

  int by = blockIdx.y;

  int idx = blockIdx.x*blockDim.x + blockIdx.y*blockDim.x*W + threadIdx.x;

__shared__ float as[block_size*block_size];

  __shared__ float ls[block_size*block_size];

  __shared__ float us[block_size*block_size];

  __shared__ float temp[block_size*block_size];

for(i=0; i<block_size; i++){	// copy into shared mem

	as[i*block_size + tx] = a[step*block_size*W + step*block_size + idx + i*W];

	ls[i*block_size + tx] = a[step*block_size*W + (step-1)*block_size + by*block_size*W + tx + i*W];

	us[i*block_size + tx] = a[(step-1)*block_size*W + step*block_size + bx*block_size + tx + i*W];

  }

// compute product

  for (i = 0; i < block_size; i++){	// i select row

	acc = 0;

	for (j = 0; j < block_size; j++) // j walk through products

	  acc += ls[i*block_size + j] * us[j*block_size + tx];

	temp[i*block_size + tx] = acc; // ls stores also the new computed as'

  }

// complete computation and write back results

  for(i=0; i<block_size; i++)

	a[step*block_size*W + step*block_size + idx + i*W] = as[i*block_size + tx] - temp[i*block_size + tx];

}

Thanks

All kernel parameters are stored in shared memory, and also variables like blockIdx, blockDim, gridDim. So you can not use the full 16kB.

I suspected it was something like this… :(

I remember kernel parameters are stored in constant memory…blockIdx,blockDim,gridDim should be some special registers available on SM…

Is this changed since 2.0? or is it the result interpreted from decuda output? (or maybe somewhere in the manual?)

Thanks!

I would agree with this.You can take an experiment with a device kernel function using no share memory ,through Visual Profiler tools you would find that “static shared memory allocate one block” is something else but not 0.

But what’s difference between “static shared memory” and “dynamic shared memory”? Confused! :wacko:

I’ve tryed it both in CUDA 1 and CUDA 2 and both in debug mode and release mode but the output is the same

this output is generated by compiler

I’ve searched inside the manual but I haven’t fount anything about this problem

But why? :blink:

I think it is what Riedijk said “All kernel parameters are stored in shared memory, and also variables like blockIdx, blockDim, gridDim. So you can not use the full 16kB.”.