int tid=threadIdx.x;
int idx = blockDim.x * blockIdx.x + threadIdx.x;
for( int j = 0; j < M; j++ ){
S[ tid + j * M ] = idx*M +j;
G[ idx + j * M ] = S [ tid + j * M ];
}
If M is less than blockDim.x, then the threads will step on each other. If M is greater than blockDim.x, then you will have gaps in your shared memory, which by itself is not a problem, but you will need to allocate MM, even though you might believe that MblockDim.x is enough.
The difference between your two cases is probably due to compiler optimization, in the first case not actually fetching the value the second time but using the value from a register.
Perhaps you meant to write this instead?
int tid=threadIdx.x;
int idx = blockDim.x * blockIdx.x + threadIdx.x;
for( int j = 0; j < M; j++ ){
S[ tid + j * blockDim.x ] = idx*M +j;
G[ idx + j * M ] = S [ tid + j * blockDim.x ];
}