Reduction code for arrays at random index locations

Hi Friends

we have the following cuda kernel

global void runUP(double *r,int *n1,int *n2,int *n3,int *n4)
{

double n_1,n_2,n_3,n_4;

int tid= threadIdx.x + blockDim.x * gridDim.x;

while(tid < 1000)
{
n_1=n1[tid];
n_2=n2[tid];
n_3=n3[tid];
n_4=n4[tid];

r[n_1] += (some code like a1+a2a3… etc) //line 1
r[n_2] += (some code like a2+a1
a3… etc) // line 2
r[n_3] += (some code like a3+a4a2… etc) //line 3
r[n_4] += (some code like a4+a5
a1… etc) // line 4

tid +=1;
}

__synthreads();

}

the arrays n1 to n4 contain random values in int type.

the kernel is called from main program as : runUP<<<1,1>>>(d_r,d_n1,d_n2,d_n3,d_n4);

the above kernel runs fine for 1 block and 1 thread , but when we increase the num of blocks and threads the above code does not run…

External Image

Pls hlp me and suggestions if any are welome …

regds
Jashwantpreet Singh

you don’t need to have while cycle and adding 1 to tid, if you have multiple threads & blocks. shouldn’t each thread in each block do calculation only for one tid?

let’s assume you have array of length N. And you want to have M threads in each block. then, you can invoke kernel by (N+M-1)/M blocks and M threads. (you write runUP<<<(N+M-1)/M,M>>>(…); ) then, you calculate tid and find answer for only ONE tid. Other threads will calculate answer another tid-s. And, you should check if tid is less than N.

The kernel code will look like this:

#define N 1000000 // some number here.

#define M 512     // here, too.

__global__ void runUP(double *r,int *n1,int *n2,int *n3,int *n4)

{

double n_1,n_2,n_3,n_4;

int tid= threadIdx.x + blockDim.x * gridDim.x;

if(tid>=N) return;

n_1=n1[tid];

n_2=n2[tid];

n_3=n3[tid];

n_4=n4[tid];

r[n_1] += (some code like a1+a2*a3... etc) //line 1

r[n_2] += (some code like a2+a1*a3... etc) // line 2

r[n_3] += (some code like a3+a4*a2... etc) //line 3

r[n_4] += (some code like a4+a5*a1... etc) // line 4

}

However, you might have problem if the same number in n1, n2, n3 or n4 is written several times. Because, you don’t know in which order the blocks will be invoked and thus, you won’t know the exact answer. If you had one block and one thread, the sequence of operations was predefined.

Tsotne
External Image

I have really have the condion that :

However, you might have problem if the same number in n1[], n2[], n3[] or n4[] is written several times.

what to do for this ???

we are running a finite element code where nodes of each element are taken, but when there are shared nodes then the same number in eithe n1 to n4 is written.

Please advice me on how to create a mapping from the n values to blocks

or

else External Image