problem with copy from global to shared and back

Hi,

When I try

G[double] - global mem.
extern shared double S;

int tid=threadIdx.x;
int idx = blockDim.x * blockIdx.x + threadIdx.x;

for( int j = 0 ; j < M ; j++ ){

S[ tid + j * M ] = idx*M +j ;

G[ idx + j * M ] = S [ tid + j * M ] ;

                                   }

and then copy to host
all right
we have a line

but then use this:

for( int j = 0 ; j < M ; j++ ){

S[ tid + j * M ] = idx*M +j ;

                                   }

for( int j = 0 ; j < M ; j++ ){

G[ idx + j * M ] = S [ tid + j * M ] ;

                                   }

we receive that
all data equal 0 except
M elements every M elements and then again from M*BlockDim.x

Will be very grateful for the assistance

I tried to reconstruct your situation. Can you show your complete kernel and the corresponding configuration, please?

How big for your shared memory?

It may happen with your index in shared memory.

int tid=threadIdx.x;

int idx = blockDim.x * blockIdx.x + threadIdx.x;

for( int j = 0; j < M; j++ ){

	S[ tid + j * M ] = idx*M +j;

	G[ idx + j * M ] = S [ tid + j * M ];

}

If M is less than blockDim.x, then the threads will step on each other. If M is greater than blockDim.x, then you will have gaps in your shared memory, which by itself is not a problem, but you will need to allocate MM, even though you might believe that MblockDim.x is enough.

The difference between your two cases is probably due to compiler optimization, in the first case not actually fetching the value the second time but using the value from a register.

Perhaps you meant to write this instead?

int tid=threadIdx.x;

int idx = blockDim.x * blockIdx.x + threadIdx.x;

for( int j = 0; j < M; j++ ){

	S[ tid + j * blockDim.x ] = idx*M +j;

	G[ idx + j * M ] = S [ tid + j * blockDim.x ];

}

Thanks a lot for help,

Compiling without optimization has not effect and it does not work

below are given core code

__global__ void test(  int M, int str, double *G , int BD)

{

		extern __shared__ double S[];

		

		int tid=threadIdx.x,it;

				int idx = blockDim.x * blockIdx.x + threadIdx.x;

			for(int j=0; j<M; j++){

				it=idx+j*str;

		S[tid+j*BD]=BD;

				}

			for(int j=0; j<M; j++){

				it=idx+j*str;

		G[it]=S[tid+j*BD];

				}

}

and call in host

test<<<BC,TC>>>(  M, str, G ,TC);

This invocation does not allocate any shared memory. See the 2.1 programming guide, page 23 for more information.

Try this:

test<<<BC,TC,M*M*sizeof(double)>>>(  M, str, G ,TC);

You will also need to compile with -arch sm_13 for doubles to work.