you don’t need to have while cycle and adding 1 to tid, if you have multiple threads & blocks. shouldn’t each thread in each block do calculation only for one tid?
let’s assume you have array of length N. And you want to have M threads in each block. then, you can invoke kernel by (N+M-1)/M blocks and M threads. (you write runUP<<<(N+M-1)/M,M>>>(…); ) then, you calculate tid and find answer for only ONE tid. Other threads will calculate answer another tid-s. And, you should check if tid is less than N.
The kernel code will look like this:
#define N 1000000 // some number here.
#define M 512 // here, too.
__global__ void runUP(double *r,int *n1,int *n2,int *n3,int *n4)
{
double n_1,n_2,n_3,n_4;
int tid= threadIdx.x + blockDim.x * gridDim.x;
if(tid>=N) return;
n_1=n1[tid];
n_2=n2[tid];
n_3=n3[tid];
n_4=n4[tid];
r[n_1] += (some code like a1+a2*a3... etc) //line 1
r[n_2] += (some code like a2+a1*a3... etc) // line 2
r[n_3] += (some code like a3+a4*a2... etc) //line 3
r[n_4] += (some code like a4+a5*a1... etc) // line 4
}
However, you might have problem if the same number in n1, n2, n3 or n4 is written several times. Because, you don’t know in which order the blocks will be invoked and thus, you won’t know the exact answer. If you had one block and one thread, the sequence of operations was predefined.
However, you might have problem if the same number in n1[], n2[], n3[] or n4[] is written several times.
what to do for this ???
we are running a finite element code where nodes of each element are taken, but when there are shared nodes then the same number in eithe n1 to n4 is written.
Please advice me on how to create a mapping from the n values to blocks