Make each thread copy a chunk of your array.
__global__ testkernel(float *gArray)
{
  __shared__ float sArray[256];
if (threadIdx.x < 256)
  {
	  sArray[threadIdx.x] = gArray[threadIdx.x];
  }
  __syncthreads();// wait for each thread to copy its elemenet
  ...
}
If you have less than 256 threads, each of them should be copying a few elements of your array:
...
 sArray[i] = gArray[i];
 sArray[i + THREADS_NUM] = gArray[i + THREADS_NUM];
 ...