Good evening everyone

I’m terrible sorry for the naive the question but I’ve just started learning CUDA.

What I’m trying to do is a GPU-distributed scalar product between an array and itself having been shifted for some position. Let me explain better: if I have my array x[n] {x1,x2,x3,x4}, what I want to do is to calculate a second array a [n-1] this way:

a1 = x1*x1+x2*x2+x3*x3+x4*x4

a2 = x1*x2+x2*x3+x3*x4
a3 = x1*x3+x2*x4

and so on…

Now, the C code is the following:

for( i=0; j<n_max; i++){

for(j=0; j<(n_max-i); j++){

a[i] += x[n]*x[n+j];

}

}

what I’m asking (saying that I want to distribute only this section of the code) is: should I create a one-dimensional kernel in order to distribute the enclosed “for” or is it better to parallelise the entire double for cycle?

Again I apologise if something is wrong with the question or some assumptions I’m making are wrong…

Thanks everyone

Erik