Shared Memory extern vs "normal" Not the same behavior between dynamic shared memory and sta

nezix · November 25, 2010, 9:56pm

Hello,

I experience a problem using shared memory on my 470 GTX (Cuda 3.2).

When I use shared memory allocated that way :

__shared__ float2 coor_shared[MAX_THREADS];

everything works great. But strangely I have “NAN” and very long numbers (as if I access unallocated memory) when I use dynamic allocated shared memory :

extern __shared__ float2 coor_shared[];

lunched with :

mykernel <<< num_threads/MAX_THREADS, MAX_THREADS,MAX_THREADS*(sizeof(unsigned int)+(sizeof(float)*3)) >>> (...);

Here is a part of my code:

__global__ void mykernel (unsigned int * rng_state,float2 *tabXY, float * sigma) {

	__shared__ unsigned int rng_shared[MAX_THREADS];

	__shared__ float2 coor_shared[MAX_THREADS];

	__shared__ float sigma_shared[MAX_THREADS];

	

	float it,n1,n2,rd;

	float2 nextXY;

	int idx=blockIdx.x * blockDim.x + threadIdx.x;

	

	rng_shared[threadIdx.x]=rng_state[idx];

	coor_shared[threadIdx.x]=tabXY[idx];

	sigma_shared[threadIdx.x]=sigma[idx];

	

	for(it=0;it<transGPU;it+=dt){

		//Euler integration

		nextXY.x=...

		

		nextXY.y=...

		

		coor_shared[threadIdx.x]=nextXY;	

	}

	

	__syncthreads();

	

	tabXY[idx]=coor_shared[threadIdx.x];

}

I just change shared → extern shared and [MAX_THREADS] →

Has someone ever met that kind of problem ?

Thank you

nezix · November 25, 2010, 9:56pm

Hello,

I experience a problem using shared memory on my 470 GTX (Cuda 3.2).

When I use shared memory allocated that way :

__shared__ float2 coor_shared[MAX_THREADS];

everything works great. But strangely I have “NAN” and very long numbers (as if I access unallocated memory) when I use dynamic allocated shared memory :

extern __shared__ float2 coor_shared[];

lunched with :

mykernel <<< num_threads/MAX_THREADS, MAX_THREADS,MAX_THREADS*(sizeof(unsigned int)+(sizeof(float)*3)) >>> (...);

Here is a part of my code:

__global__ void mykernel (unsigned int * rng_state,float2 *tabXY, float * sigma) {

	__shared__ unsigned int rng_shared[MAX_THREADS];

	__shared__ float2 coor_shared[MAX_THREADS];

	__shared__ float sigma_shared[MAX_THREADS];

	

	float it,n1,n2,rd;

	float2 nextXY;

	int idx=blockIdx.x * blockDim.x + threadIdx.x;

	

	rng_shared[threadIdx.x]=rng_state[idx];

	coor_shared[threadIdx.x]=tabXY[idx];

	sigma_shared[threadIdx.x]=sigma[idx];

	

	for(it=0;it<transGPU;it+=dt){

		//Euler integration

		nextXY.x=...

		

		nextXY.y=...

		

		coor_shared[threadIdx.x]=nextXY;	

	}

	

	__syncthreads();

	

	tabXY[idx]=coor_shared[threadIdx.x];

}

I just change shared → extern shared and [MAX_THREADS] →

Has someone ever met that kind of problem ?

Thank you

tera · November 25, 2010, 10:41pm

You can only have a single [font=“Courier New”]extern shared[/font] array.
If you need more, allocate one array large enough for all data and manually segment it into multiple non-overlapping regions. See Appendix B.2.3 of the Programming Guide for the technique to do this.

tera · November 25, 2010, 10:41pm

You can only have a single [font=“Courier New”]extern shared[/font] array.
If you need more, allocate one array large enough for all data and manually segment it into multiple non-overlapping regions. See Appendix B.2.3 of the Programming Guide for the technique to do this.

SPWorley · November 25, 2010, 10:45pm

How are you defining all three array pointers in the dynamic case? You need to offset them all from each other. Your quoted line of code only shows one definition, not all three.

Finally, I know it’s not your question, but why are you using shared memory at all? Your code snippit shows that every thread only accesses its own indices. It’s both easier code-wise and faster execution-wise to simply leave them as per-thread locals, left in fast registers.

SPWorley · November 25, 2010, 10:45pm

How are you defining all three array pointers in the dynamic case? You need to offset them all from each other. Your quoted line of code only shows one definition, not all three.

Finally, I know it’s not your question, but why are you using shared memory at all? Your code snippit shows that every thread only accesses its own indices. It’s both easier code-wise and faster execution-wise to simply leave them as per-thread locals, left in fast registers.

nezix · November 27, 2010, 8:56pm

Ok thank you, I missed that big point.
SPWorley you are right my shared memory is useless here.

Thank you again for quick replies.

nezix · November 27, 2010, 8:56pm

Ok thank you, I missed that big point.
SPWorley you are right my shared memory is useless here.

Thank you again for quick replies.