Parallelization of c++ code with OpenACC in PGI 20.7

  1. I run my code using this command:
    pgc++ -fast -acc -ta=tesla -Minfo=accel -o output code.cpp
    I used a simple code :
    #pragma acc parallel loop
    for (int i = 1; i < 100; i++)
    {
    a[i] = 10;
    y[i] = a[i-1];
    }
    I got this result:

main:
11, Generating Tesla code
15, #pragma acc loop gang, vector(99) /* blockIdx.x threadIdx.x */
11, Generating implicit copyin(a[:99]) [if not already present]
Generating implicit copyout(y[1:99]) [if not already present]
Generating implicit allocate(a[:100]) [if not already present]
Generating implicit copyout(a[1:99]) [if not already present]

  • I want to know is it parallelized successfully, and how do I know from the result in the terminal?
  • I am using (accel) is there another flag that showed more detailed?
  • is there any flag that I can use to show me the execution time? Because I want to compare and see the difference between running the code serially and parallelly.
  • in this statement: y[i] = a[i-1]; is there a loop carried dependency or not?
  1. I run another code using this command:

#pragma acc kernels
for (int i = 1; i < 100; i++)
{
x[i] = x[i-1];
}
and the result was as follow:
main:
11, Generating implicit allocate(x[:100]) [if not already present]
Generating implicit copyin(x[:99]) [if not already present]
Generating implicit copyout(x[1:99]) [if not already present]
15, Loop carried dependence of x prevents parallelization
Loop carried backward dependence of x prevents vectorization
Accelerator serial kernel generated
Generating Tesla code
15, #pragma acc loop seq

  • is the code parallelized successfully or not.
  • I read some documents that mention that in case of using :
    #pragma acc loop seq the code can’t work in parallel, but I didn’t use it. is the result here that showed
    #pragma acc loop seq a suggestion from OpenACC or kernels already applied this directive on the code.

Thanks in advance.

I want to know is it parallelized successfully, and how do I know from the result in the terminal?

It is getting parallelized, as seen by the output:

11, Generating Tesla code
15, #pragma acc loop gang, vector(99) /* blockIdx.x threadIdx.x */

However, the code has a backward dependency on “a” which will result in incorrect answers so shouldn’t be run in parallel. With the “parallel” directive, it’s the programmer’s responsibility to ensure the code is parallelizable. With “kernels”, the compiler’s loop dependency analysis would have caught this issue.

I am using (accel) is there another flag that showed more detailed?

What additional information are you looking for? The -Minfo=accel output is already very verbose, so we need to be prudent about want information is shown.

is there any flag that I can use to show me the execution time? Because I want to compare and see the difference between running the code serially and parallelly.

You can set the environment variable “NV_ACC_TIME=1” to see a quick GPU profile. Though, it’s best to use a profiler such as Nsight-Systems.

in this statement: y[i] = a[i-1]; is there a loop carried dependency or not?

Yes, it’s a loop carried dependency. By referencing a value of “a” that’s computed in the previous iteration of the loop requires the earlier iteration be run before the current one, thus creating a dependency.

For 2)

is the code parallelized successfully or not.

No, as seen in the output

Accelerator serial kernel generated
Generating Tesla code
15, #pragma acc loop seq

The compiler was able to detect the loop dependency since you’re using “kernels”, hence it still offloaded the code to the GPU, but is running it serially.

#pragma acc loop seq a suggestion from OpenACC or kernels already applied this directive on the code.

Since you didn’t include any loop directives, it’s up to the compiler to schedule the loop. With “kernels”, the compiler must prove that a loop is parallelizable before applying parallel operations. Since it’s detected the loop dependency, it’s applying seq to the implicit loop directive.

Hope this helps,
Mat

Thank you for your response Mat,
okay:

  1. can Nsight-Systems work on Linux?

  2. for (int i = 1; i < 100; i++)
    {
    a[i] = 10;
    y[i] = a[i-1];
    }
    I want to understand OpenACC correctly,
    the [i] = 10; this statement can run in parallel without any issues if it written alone but does OpenACC run the statement alone (one by one) and parallelize the one that could be run in parallel or it must run all the statements under each iteration.

-Hisham

can Nsight-Systems work on Linux?

Yes. Since I typically run it on remote systems, I’ll run it from the command line to gather the profile, then bring it back to my Windows laptop to view it. Though, the GUI works on Linux as well.

I want to understand OpenACC correctly,

To clarify, this isn’t an OpenACC specific issue, rather it’s an algorithm issue. You’d have the same problem no matter the parallel model being used.

the [i] = 10; this statement can run in parallel without any issues if it written alone but does OpenACC run the statement alone (one by one) and parallelize the one that could be run in parallel or it must run all the statements under each iteration.

Splitting the computation into two separate loops would fix the issue, but it’s not something the compiler can fix for you. You’ll need to modify the code so that it can be safely parallelized.

Continuing the discussion from Parallelization of c++ code with OpenACC in PGI 20.7:

Loop fission to initialize, and then offload the computation. I assume this is just a minimal example: the entire “a” usage can be eliminated otherwise.

Note y[0-1] is undefined, though.