Copying data from global memory to shared memory by each thread

Hi,
Sorry for this newbie question but I couldn’t locate any place where this has been answered clearly. I need to have some arrays in the shared memory which are accessed by each thread executing in a block. As the host can’t directly load the arrays into the shared memory, should I have something like this at the very beginning of the kernel?

// gArray points to an array in the global memory passed as a pointer to the kernel
// sArray is the shared array
global testkernel(float *gArray)
{
shared float sArray[256];

for(i = 0; i < 256; i++)
{
sArray[i] = gArray[i];
}

}

If I have this in the kernel will this not be executed for every single thread in the block copying the same data over to the shared memory? I can’t have each thread copying different portions of the array as I need the entire array to proceed with the computation. As the array would be a read-only I can perhaps use texture memory and I can also use syncthreads() to wait for all threads to finish copying different portions of the array but is there any other way to load the data into the shared memory once for all the threads?

Sorry if I’m missing something here.

Thanks
Shibdas

You need to load it manually. You can share that load among threads as well. Successive threads can load successive data and you can put this in a FOR loop. (this will also aid memory coalescing)

You might want to read parts 4&5 of the series of CUDA programming articles from Dr Dobbs Journal
http://www.ddj.com/architect/208401741

I found it useful when I was getting started.

Make each thread copy a chunk of your array.

__global__ testkernel(float *gArray)

{

  __shared__ float sArray[256];

if (threadIdx.x < 256)

  {

	  sArray[threadIdx.x] = gArray[threadIdx.x];

  }

  __syncthreads();// wait for each thread to copy its elemenet

  ...

}

If you have less than 256 threads, each of them should be copying a few elements of your array:

...

 sArray[i] = gArray[i];

 sArray[i + THREADS_NUM] = gArray[i + THREADS_NUM];

 ...

Thanks for all the suggestions. They are really helpful.