Is there a dependency in the loop?

Hello,

I am wondering about this loop here:

for (int i = 0; i < nsteps; ++i)

{
x = (i + 0.5) * step;
sum = sum + 1.0 / (1.0 + x * x);
}

is there a dependency between x and sum because when I run the code it is generating tesla code and that means it runs in parallel, so what is exactly happiting.

Thank you in advance

Hi hisham.M,

is there a dependency between x and sum

Yes, but only in that x needs to be assigned before it can be use in the accumulation of sum. Though this wouldn’t cause a loop dependency that would prevent parallelization of the “i” loop.

The loop dependency is with sum, in that the previous value of sum needs to be computed before it’s used in the next loop iteration. However this is a special case called a ‘reduction’ where the order in which each iterations sum is accumulated does not matter. This allows the compiler to have each parallel thread perform a partial summation on just the sum values in the iterations it is computing. After the parallel loop is complete, the compiler will then insert another loop which takes the partial sums and reduce them to the final sum.

You don’t indicate what model you’re using to parallelize, but both OpenMP and OpenACC have “reduction” clauses which you can use to expose reduction operations.

-Mat

Example on using “reduction” with OpenACC:

% cat red.c
#include <stdio.h>
#include <stdlib.h>

int main() {

    double sum=0;
    double x, step;
    int nsteps;
    nsteps=1024;
    step = 0.2;

#pragma acc parallel loop reduction(+:sum)
    for (int i = 0; i < nsteps; ++i)
    {
        x = (i + 0.5) * step;
        sum = sum + 1.0 / (1.0 + x * x);
    }
    printf("SUM: %f\n",sum);
    exit(0);
}
% nvc -fast red.c; a.out
SUM: 7.829568
% nvc -fast red.c -acc -Minfo=accel; a.out
main:
     10, Generating Tesla code
         13, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             Generating reduction(+:sum)
     10, Generating implicit copy(sum) [if not already present]
SUM: 7.829568

thank you for the explanation,
so the same thing is happening here right :

for (int i=1;i<n;i++)
{
x = x *i;
}

the compiler result was:

18, Generating Tesla code
20, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
18, Generating implicit copy(x) [if not already present]

which means it runs in parallel but now I am so confused, what is the difference here with data dependency?
** I am using OpenACC

what is the difference here with data dependency?

No difference other than you now have a product reduction as opposed as to a sum reduction. Same partial reduction can be used for both, just with a different operator.

The full list of available reductions can be found in section 2.5.14 of the OpenACC standard (https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC-3.1-final.pdf)

The others are:
“min” Minimum
“max” Maximum
“&”, “|”, “^” Bitwise and, or, xor
“&&”, “||” Logical and, or

% cat prod.c
#include <stdio.h>
#include <stdlib.h>

int main() {

    double x;
    int nsteps;
    nsteps=16;
    x=1;
#pragma acc parallel loop reduction(*:x)
    for (int i = 1; i < nsteps; ++i)
    {
        x = x*i;
    }
    printf("PROD: %f\n",x);
    exit(0);
}
% nvc prod.c -acc -Minfo=accel; a.out
main:
      9, Generating Tesla code
         11, #pragma acc loop gang /* blockIdx.x */
             Generating reduction(*:x)
      9, Generating implicit copy(x) [if not already present]
PROD: 1307674368000.000000

so in product reduction, we actually reducing the multiplication and that makes it run in parallel right?
So, we could say the loop is parallelizable and that because the reduction happing
if yes, can I use an ACC order to explain to me the reduction happing here because I can’t understand it well either in summation or production?

the loop from inside will be like this:

1 * 2 * 6 * 24

and so on
so each result will be depending on the previous result, how does the reduction happing here?

As I mentioned before, each parallel thread will perform a partial reduction on the loop iterations that it executes. Then after the loop, the compiler inserts another parallel region to perform the final reduction of each thread’s partial reduction. Reductions will work on operations that are associative since the order in which the operations occur does not matter.

For example, let’s use your example 4 iterations of the loop using two parallel threads. Thread 1 will compute a partial reduction for iterations 1 and 2, Thread 2 will compute iterations 3 and 4.

Thread 1:
I1: 1 * 1 => 1
I2: 1 * 2 => 2

Thread 2:
I3: 1 * 3 => 3
I4: 3 * 4 => 12

T1’s partial reduction = 2
T2’s partial reduction = 12

Next the final reduction will be: 2 * 12 => 24

Let reorder this so Thread 1 now computes iterations 1 and 3, and Thread 2 computes 2 and 4

Thread 1:
I1: 1 * 1 => 1
I2: 1 * 3 => 3

Thread 2:
I3: 1 * 2 => 2
I4: 2 * 4 => 8

T1’s partial reduction = 3
T2’s partial reduction = 8

Next the final reduction will be: 3 * 8 => 24

Does this help you understand the basics of how parallel reductions work?

I did find this Wikipedia page which gives more of the theoretical background if you want more details: Reduction operator - Wikipedia