Iteration control

Good evening everyone

I’m terrible sorry for the naive the question but I’ve just started learning CUDA.

What I’m trying to do is a GPU-distributed scalar product between an array and itself having been shifted for some position. Let me explain better: if I have my array x[n] {x1,x2,x3,x4}, what I want to do is to calculate a second array a [n-1] this way:
a1 = x1x1+x2x2+x3x3+x4x4
a2 = x1x2+x2x3+x3x4
a3 = x1
x3+x2*x4

and so on…

Now, the C code is the following:

for( i=0; j<n_max; i++){
for(j=0; j<(n_max-i); j++){
a[i] += x[n]*x[n+j];
}
}

what I’m asking (saying that I want to distribute only this section of the code) is: should I create a one-dimensional kernel in order to distribute the enclosed “for” or is it better to parallelise the entire double for cycle?

Again I apologise if something is wrong with the question or some assumptions I’m making are wrong…
Thanks everyone
Erik

You might find this question/answer interesting:

http://stackoverflow.com/questions/27239835/parallelizing-a-for-loop-1d-naive-convolution-in-cuda

Even though your problem statement may not be technically a “convolution”, you might find that convolution-type methods or techniques are applicable.

In general, whether or not to parallelize one or both loops is a question that might require further analysis to answer. An important item for the analysis would be the expected length of your vector x, ie. what is n?

If n is “small”, say less than 10000, you will almost certainly want to find a method that parallelizes both loops. If n is large, then you might be able to find an efficient technique that only parallelizes one loop.

Note that convolutional problems, if they can be mapped to an equivalent FFT realization, may often witness a large speedup using a frequency domain approach to solution.

// naive, not tuned/optimized
__global__ void erik_convolution(const float* x, float* a, unsigned int n_max) {
  unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
  unsigned int j = blockDim.y * blockIdx.y + threadIdx.y;
  // not effecient, half of threads do not work...
  if ( i < n_max && j < n_max - i ) {
    atomicAdd(a+i, x[i]*x[i+j);
  }
}

float* a; // device-memory (need to be initialized to 0)
float* x; // device-memory
...
dim3 block(16,16);
dim3 grid((n_max+15)/16, (n_max+15)/16);
erik_convolution<<<grid,block>>>(x, a, n_max);

n_max needs to be big enough(> 10K).
otherwise it seems better to be performed with multi-threads on CPU.