- i have the following kernel which takes a two dimensional array and adds the first n elements of each row, the n depending on which block th row belongs to.
#define ROWS 64
#define THREADS 32
#define ENTRIES 64
// *A = ROWS * ENTRIES, *B = No of thread Blocks ( == 2) , *C = ROWS
kernel (*A, *B, *C){
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int iterations = B[blockIdx.x];
int sum =0;
for(int i =0;i<iterations;i++){
sum += A[tid + i*ROWS]
}
C[tid] = sum;
}
The code works. I just want to know the variable iterations is initialised - will each thread access the global memory and load it into a register ? how am to do this if i want to use the shared memory ? is there any penalty if more than one thread accesses the same global memory location ( i mean in addition to the already poor latency ?)
- Is there any way we can address the registers present in a multi-processor ?
ps: Sorry for the double post. Mods plz delete the other thread.