proper size for dynamic shared memory

Hello ,

I wanted to ask if Iam using the dynamic shared size properly becaause running my code ,it shows illegal address and I have to use a large size ,for example 500 in order to work.

typedef struct
{
	float X,Y;
	int Value;

} Points;

In the kernel:

int x = ( blockIdx.x * blockDim.x ) + threadIdx.x; 
int y = ( blockIdx.y * blockDim.y ) + threadIdx.y; 

extern __shared__ Points points[ ];
float distX , distY;

if ( x < blockDim.x * gridDim.x && y < blockDim.y * gridDim.y ) 
{
	for ( int i = 0; i < NbOfNodes; i++ )
	{
		//load data to shared memory
		points[ gridDim.x * threadIdx.y + threadIdx.x ].X 	    = X[ i ];
		points[ gridDim.x * threadIdx.y + threadIdx.x ].Y 	    = Y[ i ];
		points[ gridDim.x * threadIdx.y + threadIdx.x ].Value  = V[ i ];

		__syncthreads();

//Calculate distances for all the points
for ( int j = 0; j < NbOfNodes; j++ )
{

	distX = points[ j ].X - x;
	distY = points[ j ].Y - y;
...

So , my shared memory size should be : ThreadsPerBlockX + ThreadsPerBlockY * BlocksPerGridX , right?

I am receiving the illegal address at distX.

In the calling function:

dim3 BlocksDim ( BlocksPerGridX , BlocksPerGridY );
dim3 ThreadsPerBlock ( ThreadsPerBlockX , ThreadsPerBlockY );

size_t DynamicSharedSize = ThreadsPerBlockX + ThreadsPerBlockY * BlocksPerGridX;

for ( int i = 0; i < NbOfNodes; i++ )
{
	//Call kernel
	mykernel<<< BlocksDim,ThreadsPerBlock, DynamicSharedSize * sizeof(Points) >>>(
...

Any ideas?

Thanks!

Shared memory is allocated per thread block, so this address computation using gridDim looks suspicious:

points[ gridDim.x * threadIdx.y + threadIdx.x ]

Hmm …right! I missed that!Thank you!

So, I did that because I want to use 16 threads per block ( for x and y dimens) and 16 or 32 blocks per grid according to the number of elements.

The problem is that if I switchover the thread per blocks and blocks per grid dimensions ( hence use 32 thread and 16 blocks ) , I have worse performance than the opposite…

Hmm… This didn’t work unfortunately…

What is the correct size that I have to use?

ANy ideas?
Thanks!

I forgot to mention that the input data values ( X and Y ) are in the range 0 - 256 .

Hmm. I changed to :

points[ NbOfNodes * threadIdx.y + threadIdx.x ].X 	    = X[ i ];

and it seems to work.