need some help with this code

  1. i have the following kernel which takes a two dimensional array and adds the first n elements of each row, the n depending on which block th row belongs to.
#define ROWS 64  

#define THREADS 32

#define ENTRIES 64 

// *A = ROWS * ENTRIES,  *B = No of thread Blocks ( == 2) ,  *C = ROWS

kernel (*A, *B, *C){

int tid = blockIdx.x * blockDim.x + threadIdx.x;

int iterations = B[blockIdx.x];

int sum =0;

for(int i =0;i<iterations;i++){

				sum += A[tid + i*ROWS]


C[tid] = sum;


The code works. I just want to know the variable iterations is initialised - will each thread access the global memory and load it into a register ? how am to do this if i want to use the shared memory ? is there any penalty if more than one thread accesses the same global memory location ( i mean in addition to the already poor latency ?)

  1. Is there any way we can address the registers present in a multi-processor ?

ps: Sorry for the double post. Mods plz delete the other thread.

First of all, your code is missing some bound checkings to make sure youre not writing/reading out of bounds.

Something along the lines of

As written, every thread from every block will read from global memory to initialize the itrations variable.
You could/should use shared memory.
shared int iterations

Shared memory can be broadcasted so since each thread will access ‘iterations’ at the same time, it will get broadcasted to all threads.

For 2. dunno what you want exactly.