Hi,
I have a 128x128 matrix and 128 threads (4 warps) that each are working on one row (rA[i]).
I want to avoid using shared memory (Line #17) and in iteration #i
, broadcast the row #i
to other threads (gtx>i) .
-
Can I move the whole row? Or should I do this for each row element by element?
-
By removing the
y[N]
which is a shared memory, Should I define a new register? -
The srclane should be
0
or equal to the # of the thread? -
Should I use SHFL just for one warp and reduce the size of matrix to 32?
1 const int tx = threadIdx.x;
2 const int gtx = blockIdx.x * blockDim.x + tx;
3
4 double *A = dA + step + step * lda;
5 double rA[N];
6 __shared__ double y[N];
7
8 #pragma unroll
9 for(int i = 0; i < N; i++)
10 rA[i]=A[gtx + i * lda];
11
12 #pragma unroll
13 for(int i = 0; i < N; i++){
14 if(gtx == i){
15 #pragma unroll
16 for(int j = 0; j < N; j++)
17 y[j] = rA[j];
18 }
19 __syncthreads();
20 }