Broadcast with SHFL

Hi,
I have a 128x128 matrix and 128 threads (4 warps) that each are working on one row (rA[i]).
I want to avoid using shared memory (Line #17) and in iteration #i, broadcast the row #i to other threads (gtx>i) .

  1. Can I move the whole row? Or should I do this for each row element by element?

  2. By removing the y[N] which is a shared memory, Should I define a new register?

  3. The srclane should be 0 or equal to the # of the thread?

  4. Should I use SHFL just for one warp and reduce the size of matrix to 32?

1        const int tx  = threadIdx.x;
2        const int gtx = blockIdx.x * blockDim.x + tx;
3 
4        double *A = dA + step + step * lda;
5        double rA[N];
6        __shared__ double y[N];
7
8       #pragma unroll
9          for(int i = 0; i < N; i++)
10                 rA[i]=A[gtx + i * lda];
11
12        #pragma unroll
13	 for(int i = 0; i < N; i++){
14		if(gtx == i){
15         #pragma unroll
16			for(int j = 0; j < N; j++)
17				y[j] = rA[j];
18		}
19		__syncthreads();
20	}