Efficiently loading data in the shared memory

I have a question about loading data from global memory to the shared memory. My grid is defined as follows:
dim3 block(1024,1);
dim3 grid((256 * 256/ block.x));

As you can see, the total number of threads in my Kernel is 256256=65536. Each of these threads needs to access a memory with a size of 6404=2560Bytes =(NumberOfElements*sizeof(int)). As this is going to happen repeatedly, I want to load this global memory to the shared memory. How can i efficiently make this shared memory available to all the threads considering the fact that the shared memory life time is block dependent.
I used the following code in my kernel before any processing, but at the end, my processing time was even higher compared to using the global memory. So, I’m looking for another way.

int TID = threadIdx.y * blockDim.x + threadIdx.x;
if (TID < 640 )
SharedMemory[TID] = RfData[XXX];

Thanks in advance.