need some help with this code

  1. i have the following kernel which takes a two dimensional array and adds the first n elements of each row, the n depending on which block th row belongs to.
#define ROWS 64  

#define THREADS 32

#define ENTRIES 64 

// *A = ROWS * ENTRIES,  *B = No of thread Blocks ( == 2) ,  *C = ROWS

kernel (*A, *B, *C){

int tid = blockIdx.x * blockDim.x + threadIdx.x;

int iterations = B[blockIdx.x];

int sum =0;

for(int i =0;i<iterations;i++){

				sum += A[tid + i*ROWS]

		 }

C[tid] = sum;

}

The code works. I just want to know the variable iterations is initialised - will each thread access the global memory and load it into a register ? how am to do this if i want to use the shared memory ? is there any penalty if more than one thread accesses the same global memory location ( i mean in addition to the already poor latency ?)

  1. Is there any way we can address the registers present in a multi-processor ?

ps: Sorry for the double post. Mods plz delete the other thread.

First of all, your code is missing some bound checkings to make sure youre not writing/reading out of bounds.

Something along the lines of
if(tid<maxid)
c[tid]=sum;

As written, every thread from every block will read from global memory to initialize the itrations variable.
You could/should use shared memory.
shared int iterations
if(threadidx.x==0)
iterations=…;
__syncthreads()

Shared memory can be broadcasted so since each thread will access ‘iterations’ at the same time, it will get broadcasted to all threads.

For 2. dunno what you want exactly.