ombra32
February 23, 2009, 1:07pm
1
In my kernel I’ve allocated 4 shared memory locations of 32x32 elements, each one of 4 bytes (data type is float)
After launching the compiler, I receive back this error:
ptxas error : Entry function '_Z13LU_updaterestPfii' uses too much shared data (0x1000 bytes + 0x301c bytes system, 0x4000 max)
But:
3232 4*4 = 16384 Bytes = 16 kB and shared memory is max 16kB. .!!!
Why I receive this error?
My kernel is:
__global__ void LU_updaterest(float* a, int step, int W)
{
int i, j;
float acc;
int tx = threadIdx.x;
int bx = blockIdx.x;
int by = blockIdx.y;
int idx = blockIdx.x*blockDim.x + blockIdx.y*blockDim.x*W + threadIdx.x;
__shared__ float as[block_size*block_size];
__shared__ float ls[block_size*block_size];
__shared__ float us[block_size*block_size];
__shared__ float temp[block_size*block_size];
for(i=0; i<block_size; i++){ // copy into shared mem
as[i*block_size + tx] = a[step*block_size*W + step*block_size + idx + i*W];
ls[i*block_size + tx] = a[step*block_size*W + (step-1)*block_size + by*block_size*W + tx + i*W];
us[i*block_size + tx] = a[(step-1)*block_size*W + step*block_size + bx*block_size + tx + i*W];
}
// compute product
for (i = 0; i < block_size; i++){ // i select row
acc = 0;
for (j = 0; j < block_size; j++) // j walk through products
acc += ls[i*block_size + j] * us[j*block_size + tx];
temp[i*block_size + tx] = acc; // ls stores also the new computed as'
}
// complete computation and write back results
for(i=0; i<block_size; i++)
a[step*block_size*W + step*block_size + idx + i*W] = as[i*block_size + tx] - temp[i*block_size + tx];
}
Thanks
All kernel parameters are stored in shared memory, and also variables like blockIdx, blockDim, gridDim. So you can not use the full 16kB.
ombra32
February 23, 2009, 2:54pm
3
All kernel parameters are stored in shared memory, and also variables like blockIdx, blockDim, gridDim. So you can not use the full 16kB.
I suspected it was something like this… :(
All kernel parameters are stored in shared memory, and also variables like blockIdx, blockDim, gridDim. So you can not use the full 16kB.
I remember kernel parameters are stored in constant memory…blockIdx,blockDim,gridDim should be some special registers available on SM…
Is this changed since 2.0? or is it the result interpreted from decuda output? (or maybe somewhere in the manual?)
Thanks!
Austin
February 24, 2009, 8:52am
5
All kernel parameters are stored in shared memory, and also variables like blockIdx, blockDim, gridDim. So you can not use the full 16kB.
I would agree with this.You can take an experiment with a device kernel function using no share memory ,through Visual Profiler tools you would find that “static shared memory allocate one block” is something else but not 0.
But what’s difference between “static shared memory” and “dynamic shared memory”? Confused! :wacko:
ombra32
February 24, 2009, 9:05am
6
I’ve tryed it both in CUDA 1 and CUDA 2 and both in debug mode and release mode but the output is the same
this output is generated by compiler
I’ve searched inside the manual but I haven’t fount anything about this problem
Austin
February 24, 2009, 11:56am
8
I think it is what Riedijk said “All kernel parameters are stored in shared memory, and also variables like blockIdx, blockDim, gridDim. So you can not use the full 16kB.”.