Shared memory is not correctly used in kernels block

When fixed size array is used inside the parallel loop with kernels directive,
it is not compiled as shared memory. As a results, the swap calculation fails and strange results is obtained in the simple program as follows;

#define DIM 3
#define N 10
const int W=3;

int main(int argc, char *argv){

double position[N][DIM];
for(int iP=0;iP<N;++iP){
position[iP][0]=iP/W;
position[iP][1]=iP%W;
position[iP][2]=0.0;
}

for(int iP=0;iP<N;++iP){
printf(“%d: %e %e %e\n”,iP,position[iP][0],position[iP][1],position[iP][2]);
}
printf(“\n”);

#pragma acc data copy(position[0:N][0:DIM])
{
#pragma acc kernels loop independent // compiler message: local memory used for previous (strange results)
// #pragma acc parallel loop independent // compiler message: sheared memory used for previous (sound results)
for(int iP=0;iP<N;++iP){
double previous[DIM];
#pragma acc loop seq
for(int iD=0;iD<DIM;++iD){
previous[iD]=position[iP][iD];
}
position[iP][0]=previous[1];
position[iP][1]=previous[2];
position[iP][2]=previous[0];
}
}

for(int iP=0;iP<N;++iP){
printf(“%d: %e %e %e\n”,iP,position[iP][0],position[iP][1],position[iP][2]);
}
printf(“\n”);

return 0;
}

The program was tested with the compiler versions of 20.9, 22.1 and 22.5.
Version 20.9 was sound, but 22.1 and 22.5 yielded the strange results.
Sorry for missing compiler information.

Hi MasahiroKondo64929,

The compiler will attempt to include gang private arrays in shared memory. Though in the kernels case, the outer loop is scheduled using gangs and vectors, hence the “previous” array is private to a vector and stored in local memory. For “parallel”, only a gang parallelism is used and why it’s placed in shared memory.

If you make the inner loop use “vector” instead of “seq”, or add “gang” to the outer loop schedule, then shared memory will be used.

Example:

% nvc -acc -Minfo=accel test.c
main:
     23, Generating copy(position[:][:]) [if not already present]
     26, Loop is parallelizable
         Generating NVIDIA GPU code
         26, #pragma acc loop gang /* blockIdx.x */
         29, #pragma acc loop vector(32) /* threadIdx.x */
     26, CUDA shared memory used for previous
     29, Loop is parallelizable

Now it may not make sense to parallelize the inner loop given the loop trip count is only 3, but hopefully this illustrates what’s going on.

-Mat

Dear MatColgrove

Thank you for the reply. I could understand what’s going on. But, as you pointed out, parallelizing the loop only for 3 element fixed array does not seem efficient. It seems that the only outer loop can be parallelized when the inner loop is manually expanded as;

#pragma acc kernels present(position[0:N][0:DIM])
#pragma acc loop independent
for(int iP=0;iP<N;++iP){
double previous0=position[iP][0];
double previous1=position[iP][1];
double previous2=position[iP][2];
position[iP][0]=previous1;
position[iP][1]=previous2;
position[iP][2]=previous0;
}

But, this is not convincing to me because such description had been acceptable for the older compiler version such as 20.9.
Is the manual expansion only the way for only parallelizing the outer loop in the current compilers, e.g. 22.1 and 22.5 ?

regards,

Masahiro Kondo

Sorry but I’m not clear on the question.

You can parallelize the inner loop in the first example, it’s just that given a warp has 32 threads, using only three of them with the remaining 29 idle, isn’t ideal.

The outer loop can be parallelized as well, again since it’s have a loop trip count of 10, it’s not big enough to fill a GPU. There’s no need to manually unroll the inner loop.

Dear Mat

Thank you, and sorry for confusing you. The code sample firstly shown on this topic is just prepared to figure out the situation where the calculation gets strange. So, this is a kind of test code. We are applying openACC for accelerating particle simulations. So, the “N”, which is number of particles, is not just 10, but is is very large and dynamically given. Sometimes, it’s close to 1 million. In such case, parallelizing inner loop wasting 29 (= 32-3) threads does not seem efficient. So, the way for only parallelizing outer loop is desired. Is this make sense?

regards,

Masahiro Kondo

Typically when the outer loop trip count is large and the inner loop trip count is very small, then yes, it’s best to only parallelize the outer loop. There are of course exceptions and it depends on the specifics of actual code, but that’s the typical recommendation.

My confusion was with your question “Is the manual expansion only the way for only parallelizing the outer loop in the current compilers,”. Manual expansion, what I would call unrolling, normally doesn’t effect whether the outer loop can be parallelized or not.

So, yes, parallelize the outer loop but no, manual expansion isn’t the only way to parallelize it. Though, I don’t have a complete picture of your code so unclear why you thought this might be case.

Dear Mat

Thank you for the reply.
With older compiler 20.9, the outer loop was parallelized using gang and vector(128). However, It seems that the current compilers 22.1 and 22.5 use only gang for outer loop parallelization when we do not explicitly apply unrolling. I’m wondering whether this is as efficient as the old version because it does not use vector parallelization.

When unrolling is done, the vector parallelization for outer loop is applied even with the current compilers 22.1 and 22.5 as well.

If you need to see the complete picture of my code, please visit
https://github.com/Masahiro-Kondo-AIST/MphExplicit.git

regards,

Masahiro kondo

I’m not seeing this in the original sample code you posted nor in the git repo code. Which loop in the git code do you see this behavior?

There was a change in behavior in how pointers are handled to be more in line with the OpenACC spec. They are now treated as scalars so you need to use triplet notation, i,e, “array[:size]”, for arrays. This is causing the code to get compilation errors in 22.5 since this isn’t used in your “declare create” statements, and several arrays need to be added to data regions. I didn’t fix this, but rather used managed memory (i.e. add -gpu=managed) so I could view the schedule.

Here’s the output from compiling the original sample code with 22.5 and 20.9. The schedules used are exactly the same.

% nvc -V22.5 -acc -Minfo=acc -w test1.c
main:
     20, Generating copy(position[:][:]) [if not already present]
     23, Loop is parallelizable
         Generating NVIDIA GPU code
         23, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
         26, #pragma acc loop seq
     23, Local memory used for previous
% nvc -V20.9 -acc -Minfo=acc -w test1.c
main:
     20, Generating copy(position[:][:]) [if not already present]
     23, Loop is parallelizable
         Generating Tesla code
         23, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
         26, #pragma acc loop seq
     23, Local memory used for previous

Dear Mat

The output from the compiler 22.5 was the same as what you’ve shown.
What I got is as follows:

pgc++ -acc -O3 -Minfo=accel -gpu=managed -c main.cpp
main:
28, Generating copy(position[:][:]) [if not already present]
31, Loop is parallelizable
Generating NVIDIA GPU code
31, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
34, #pragma acc loop seq
31, Local memory used for previous
pgc++ -acc -O3 -Minfo=accel -gpu=managed -o test main.o -lm

However, the program output was strange. Specifically, the swapping never occurred:

0: 0.000000e+00 0.000000e+00 0.000000e+00
1: 0.000000e+00 1.000000e+00 0.000000e+00
2: 0.000000e+00 2.000000e+00 0.000000e+00
3: 1.000000e+00 0.000000e+00 0.000000e+00
4: 1.000000e+00 1.000000e+00 0.000000e+00
5: 1.000000e+00 2.000000e+00 0.000000e+00
6: 2.000000e+00 0.000000e+00 0.000000e+00
7: 2.000000e+00 1.000000e+00 0.000000e+00
8: 2.000000e+00 2.000000e+00 0.000000e+00
9: 3.000000e+00 0.000000e+00 0.000000e+00

0: 0.000000e+00 0.000000e+00 0.000000e+00
1: 0.000000e+00 1.000000e+00 0.000000e+00
2: 0.000000e+00 2.000000e+00 0.000000e+00
3: 1.000000e+00 0.000000e+00 0.000000e+00
4: 1.000000e+00 1.000000e+00 0.000000e+00
5: 1.000000e+00 2.000000e+00 0.000000e+00
6: 2.000000e+00 0.000000e+00 0.000000e+00
7: 2.000000e+00 1.000000e+00 0.000000e+00
8: 2.000000e+00 2.000000e+00 0.000000e+00
9: 3.000000e+00 0.000000e+00 0.000000e+00

This is why I posted this topic.
I cannot say in which part in the git code shows the strange behavior because it’s too long for checking. So, I thought that we should confirm the basic behavior in the small program. Actually, I broke done the git code and come to the test code as posted in this topic. Firstly, I’d like to know how the outer loop can only be parallelized soundly using both gang and vectors via the current compilers.

Sorry, I could not clearly understand the change in pointer handling in OpenACC which you explained. I tried the option “-gpu=managed”, but the output from the compiler and the compiled program were the same as those without the option.

regards,

Masahiro Kondo

Thanks! Finally determined the difference. You’re using C++ (fails) while I was using C (works). Looks like a compiler code generation issue with how it’s accessing “previous”. I’ve filed a problem report, TPR #31948, and sent to engineering for investigation.

As a work-around, you can hoist the declaration of “previous” and then manually privatize it. Note “managed” only works with dynamically allocated data, so would be ineffective in this case.

Example:

% cat main.cpp
#include <stdio.h>
#include <stdlib.h>

#define DIM 3
#define N 10
const int W=3;

int main(int argc, char *argv){

double position[N][DIM];
for(int iP=0;iP<N;++iP){
position[iP][0]=iP/W;
position[iP][1]=iP%W;
position[iP][2]=0.0;
}

for(int iP=0;iP<N;++iP){
printf("%d: %e %e %e\n",iP,position[iP][0],position[iP][1],position[iP][2]);
}
printf("\n");
#ifdef WORKS
double previous[DIM];
#endif

#pragma acc data copy(position[0:N][0:DIM])
{
#ifdef WORKS
#pragma acc kernels loop independent private(previous[:DIM])
#else
#pragma acc kernels loop independent
#endif
for(int iP=0;iP<N;++iP){
#ifndef WORKS
double previous[DIM];
#endif
#pragma acc loop seq
for(int iD=0;iD<DIM;++iD){
previous[iD]=position[iP][iD];
}
position[iP][0]=previous[1];
position[iP][1]=previous[2];
position[iP][2]=previous[0];
}
}

for(int iP=0;iP<N;++iP){
printf("%d: %e %e %e\n",iP,position[iP][0],position[iP][1],position[iP][2]);
}
printf("\n");

return 0;
}
% nvc++ -acc -V22.5 main.cpp -Minfo=accel ; a.out
main:
     26, Generating copy(position[:][:]) [if not already present]
     32, Loop is parallelizable
         Generating NVIDIA GPU code
         32, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
         37, #pragma acc loop seq
     32, Local memory used for previous
0: 0.000000e+00 0.000000e+00 0.000000e+00
1: 0.000000e+00 1.000000e+00 0.000000e+00
2: 0.000000e+00 2.000000e+00 0.000000e+00
3: 1.000000e+00 0.000000e+00 0.000000e+00
4: 1.000000e+00 1.000000e+00 0.000000e+00
5: 1.000000e+00 2.000000e+00 0.000000e+00
6: 2.000000e+00 0.000000e+00 0.000000e+00
7: 2.000000e+00 1.000000e+00 0.000000e+00
8: 2.000000e+00 2.000000e+00 0.000000e+00
9: 3.000000e+00 0.000000e+00 0.000000e+00

0: 0.000000e+00 0.000000e+00 0.000000e+00
1: 0.000000e+00 1.000000e+00 0.000000e+00
2: 0.000000e+00 2.000000e+00 0.000000e+00
3: 1.000000e+00 0.000000e+00 0.000000e+00
4: 1.000000e+00 1.000000e+00 0.000000e+00
5: 1.000000e+00 2.000000e+00 0.000000e+00
6: 2.000000e+00 0.000000e+00 0.000000e+00
7: 2.000000e+00 1.000000e+00 0.000000e+00
8: 2.000000e+00 2.000000e+00 0.000000e+00
9: 3.000000e+00 0.000000e+00 0.000000e+00

% nvc++ -acc -V22.5 main.cpp -Minfo=accel -DWORKS ; a.out
main:
     26, Generating copy(position[:][:]) [if not already present]
     32, Loop is parallelizable
         Generating NVIDIA GPU code
         32, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
         37, #pragma acc loop seq
     32, Local memory used for previous
0: 0.000000e+00 0.000000e+00 0.000000e+00
1: 0.000000e+00 1.000000e+00 0.000000e+00
2: 0.000000e+00 2.000000e+00 0.000000e+00
3: 1.000000e+00 0.000000e+00 0.000000e+00
4: 1.000000e+00 1.000000e+00 0.000000e+00
5: 1.000000e+00 2.000000e+00 0.000000e+00
6: 2.000000e+00 0.000000e+00 0.000000e+00
7: 2.000000e+00 1.000000e+00 0.000000e+00
8: 2.000000e+00 2.000000e+00 0.000000e+00
9: 3.000000e+00 0.000000e+00 0.000000e+00

0: 0.000000e+00 0.000000e+00 0.000000e+00
1: 1.000000e+00 0.000000e+00 0.000000e+00
2: 2.000000e+00 0.000000e+00 0.000000e+00
3: 0.000000e+00 0.000000e+00 1.000000e+00
4: 1.000000e+00 0.000000e+00 1.000000e+00
5: 2.000000e+00 0.000000e+00 1.000000e+00
6: 0.000000e+00 0.000000e+00 2.000000e+00
7: 1.000000e+00 0.000000e+00 2.000000e+00
8: 2.000000e+00 0.000000e+00 2.000000e+00
9: 0.000000e+00 0.000000e+00 3.000000e+00

-Mat

Dear Mat Colgrove

Thank you for the reply and for posting this issue to the problem report TPR#31948.
I confirmed that the output of the test program was correct with

  • moving the definition “double previous[DIM]” out of the parallel loop
  • adding the directive “private(previous[0:DIM]” for the parallel loop

Could I see the progress with respect to #TRP31948 via this forums?

regards,

Masahiro Kondo

No, sorry. If you file an issue directly through NVBugs, then engineers can post comments that are visible to you. However my team still uses the PGI TPR system, which we’ve been using for about 30 years now. Lots of history that we didn’t want to lose after NVIDIA bought us, so we kept it. The downside being we don’t have a way to make it externally visible.

I do post notification after an issue has been resolved in a release.

I understood. I’ll wait for the further notice from you.
Thank you very much!