Transfer a one-dimensional array saved by rows-major from global memory to shared memory

Hi, I’m new to Cuda and I have a question (I think quite simple), should I transfer a one-dimensional (float) array, which contains a saved image by lines, from global memory to shared memory. For now I have written a possible code but I really think it is not efficient because it is executed by only one threads per block. How could I have all the threads in the block do this? The blocks are two-dimensional (32,32) and the grid is made up of N blocks (with N according to the size of the image).

__shared__ float s_template[4800];
	 if (threadIdx.x == 0) {
        for (int j = 0;j < Th;j++) {
            for (int t = 0;t < Tw;t++) {
                s_template[t+j * Tw] = T[t+j * Tw];

You would need to modify that example for a 2D threadblock, something like this:

#define SSIZE 2592

__shared__ float TMshared[SSIZE]; 

  int lidx = threadIdx.x + blockDim.x*threadIdx.y;
  while (lidx < SSIZE){
    TMShared[lidx] = TM[lidx];
    lidx += blockDim.x*blockDim.y;}


Each block would get a copy of the same data in its shared memory (both for my code and yours).