Shared memory is not correctly used in kernels block

MasahiroKondo64929 · June 7, 2022, 12:25pm

When fixed size array is used inside the parallel loop with kernels directive,
it is not compiled as shared memory. As a results, the swap calculation fails and strange results is obtained in the simple program as follows;

define DIM 3
define N 10
const int W=3;

int main(int argc, char *argv){

double position[N][DIM];
for(int iP=0;iP<N;++iP){
position[iP][0]=iP/W;
position[iP][1]=iP%W;
position[iP][2]=0.0;
}

for(int iP=0;iP<N;++iP){
printf(“%d: %e %e %e\n”,iP,position[iP][0],position[iP][1],position[iP][2]);
}
printf(“\n”);

#pragma acc data copy(position[0:N][0:DIM])
{
#pragma acc kernels loop independent // compiler message: local memory used for previous (strange results)
// #pragma acc parallel loop independent // compiler message: sheared memory used for previous (sound results)
for(int iP=0;iP<N;++iP){
double previous[DIM];
#pragma acc loop seq
for(int iD=0;iD<DIM;++iD){
previous[iD]=position[iP][iD];
}
position[iP][0]=previous[1];
position[iP][1]=previous[2];
position[iP][2]=previous[0];
}
}

for(int iP=0;iP<N;++iP){
printf(“%d: %e %e %e\n”,iP,position[iP][0],position[iP][1],position[iP][2]);
}
printf(“\n”);

return 0;
}

MasahiroKondo64929 · June 7, 2022, 12:49pm

The program was tested with the compiler versions of 20.9, 22.1 and 22.5.
Version 20.9 was sound, but 22.1 and 22.5 yielded the strange results.
Sorry for missing compiler information.

MatColgrove · June 7, 2022, 3:59pm

Hi MasahiroKondo64929,

The compiler will attempt to include gang private arrays in shared memory. Though in the kernels case, the outer loop is scheduled using gangs and vectors, hence the “previous” array is private to a vector and stored in local memory. For “parallel”, only a gang parallelism is used and why it’s placed in shared memory.

If you make the inner loop use “vector” instead of “seq”, or add “gang” to the outer loop schedule, then shared memory will be used.

Example:

% nvc -acc -Minfo=accel test.c
main:
     23, Generating copy(position[:][:]) [if not already present]
     26, Loop is parallelizable
         Generating NVIDIA GPU code
         26, #pragma acc loop gang /* blockIdx.x */
         29, #pragma acc loop vector(32) /* threadIdx.x */
     26, CUDA shared memory used for previous
     29, Loop is parallelizable

Now it may not make sense to parallelize the inner loop given the loop trip count is only 3, but hopefully this illustrates what’s going on.

-Mat

MasahiroKondo64929 · June 8, 2022, 12:35am

Dear MatColgrove

Thank you for the reply. I could understand what’s going on. But, as you pointed out, parallelizing the loop only for 3 element fixed array does not seem efficient. It seems that the only outer loop can be parallelized when the inner loop is manually expanded as;

#pragma acc kernels present(position[0:N][0:DIM])
#pragma acc loop independent
for(int iP=0;iP<N;++iP){
double previous0=position[iP][0];
double previous1=position[iP][1];
double previous2=position[iP][2];
position[iP][0]=previous1;
position[iP][1]=previous2;
position[iP][2]=previous0;
}

But, this is not convincing to me because such description had been acceptable for the older compiler version such as 20.9.
Is the manual expansion only the way for only parallelizing the outer loop in the current compilers, e.g. 22.1 and 22.5 ?

regards,

Masahiro Kondo

MatColgrove · June 8, 2022, 2:06pm

Sorry but I’m not clear on the question.

You can parallelize the inner loop in the first example, it’s just that given a warp has 32 threads, using only three of them with the remaining 29 idle, isn’t ideal.

The outer loop can be parallelized as well, again since it’s have a loop trip count of 10, it’s not big enough to fill a GPU. There’s no need to manually unroll the inner loop.

MasahiroKondo64929 · June 9, 2022, 1:59am

Dear Mat

Thank you, and sorry for confusing you. The code sample firstly shown on this topic is just prepared to figure out the situation where the calculation gets strange. So, this is a kind of test code. We are applying openACC for accelerating particle simulations. So, the “N”, which is number of particles, is not just 10, but is is very large and dynamically given. Sometimes, it’s close to 1 million. In such case, parallelizing inner loop wasting 29 (= 32-3) threads does not seem efficient. So, the way for only parallelizing outer loop is desired. Is this make sense?

regards,

Masahiro Kondo

MatColgrove · June 9, 2022, 4:14pm

Typically when the outer loop trip count is large and the inner loop trip count is very small, then yes, it’s best to only parallelize the outer loop. There are of course exceptions and it depends on the specifics of actual code, but that’s the typical recommendation.

My confusion was with your question “Is the manual expansion only the way for only parallelizing the outer loop in the current compilers,”. Manual expansion, what I would call unrolling, normally doesn’t effect whether the outer loop can be parallelized or not.

So, yes, parallelize the outer loop but no, manual expansion isn’t the only way to parallelize it. Though, I don’t have a complete picture of your code so unclear why you thought this might be case.

MasahiroKondo64929 · June 10, 2022, 7:23am

Dear Mat

Thank you for the reply.
With older compiler 20.9, the outer loop was parallelized using gang and vector(128). However, It seems that the current compilers 22.1 and 22.5 use only gang for outer loop parallelization when we do not explicitly apply unrolling. I’m wondering whether this is as efficient as the old version because it does not use vector parallelization.

When unrolling is done, the vector parallelization for outer loop is applied even with the current compilers 22.1 and 22.5 as well.

If you need to see the complete picture of my code, please visit
https://github.com/Masahiro-Kondo-AIST/MphExplicit.git

regards,

Masahiro kondo

MatColgrove · June 10, 2022, 5:54pm

I’m not seeing this in the original sample code you posted nor in the git repo code. Which loop in the git code do you see this behavior?

There was a change in behavior in how pointers are handled to be more in line with the OpenACC spec. They are now treated as scalars so you need to use triplet notation, i,e, “array[:size]”, for arrays. This is causing the code to get compilation errors in 22.5 since this isn’t used in your “declare create” statements, and several arrays need to be added to data regions. I didn’t fix this, but rather used managed memory (i.e. add -gpu=managed) so I could view the schedule.

Here’s the output from compiling the original sample code with 22.5 and 20.9. The schedules used are exactly the same.

% nvc -V22.5 -acc -Minfo=acc -w test1.c
main:
     20, Generating copy(position[:][:]) [if not already present]
     23, Loop is parallelizable
         Generating NVIDIA GPU code
         23, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
         26, #pragma acc loop seq
     23, Local memory used for previous
% nvc -V20.9 -acc -Minfo=acc -w test1.c
main:
     20, Generating copy(position[:][:]) [if not already present]
     23, Loop is parallelizable
         Generating Tesla code
         23, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
         26, #pragma acc loop seq
     23, Local memory used for previous

MasahiroKondo64929 · June 12, 2022, 12:17am

Dear Mat

The output from the compiler 22.5 was the same as what you’ve shown.
What I got is as follows:

pgc++ -acc -O3 -Minfo=accel -gpu=managed -c main.cpp
main:
28, Generating copy(position[:][:]) [if not already present]
31, Loop is parallelizable
Generating NVIDIA GPU code
31, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
34, #pragma acc loop seq
31, Local memory used for previous
pgc++ -acc -O3 -Minfo=accel -gpu=managed -o test main.o -lm

However, the program output was strange. Specifically, the swapping never occurred:

0: 0.000000e+00 0.000000e+00 0.000000e+00
1: 0.000000e+00 1.000000e+00 0.000000e+00
2: 0.000000e+00 2.000000e+00 0.000000e+00
3: 1.000000e+00 0.000000e+00 0.000000e+00
4: 1.000000e+00 1.000000e+00 0.000000e+00
5: 1.000000e+00 2.000000e+00 0.000000e+00
6: 2.000000e+00 0.000000e+00 0.000000e+00
7: 2.000000e+00 1.000000e+00 0.000000e+00
8: 2.000000e+00 2.000000e+00 0.000000e+00
9: 3.000000e+00 0.000000e+00 0.000000e+00

0: 0.000000e+00 0.000000e+00 0.000000e+00
1: 0.000000e+00 1.000000e+00 0.000000e+00
2: 0.000000e+00 2.000000e+00 0.000000e+00
3: 1.000000e+00 0.000000e+00 0.000000e+00
4: 1.000000e+00 1.000000e+00 0.000000e+00
5: 1.000000e+00 2.000000e+00 0.000000e+00
6: 2.000000e+00 0.000000e+00 0.000000e+00
7: 2.000000e+00 1.000000e+00 0.000000e+00
8: 2.000000e+00 2.000000e+00 0.000000e+00
9: 3.000000e+00 0.000000e+00 0.000000e+00

This is why I posted this topic.
I cannot say in which part in the git code shows the strange behavior because it’s too long for checking. So, I thought that we should confirm the basic behavior in the small program. Actually, I broke done the git code and come to the test code as posted in this topic. Firstly, I’d like to know how the outer loop can only be parallelized soundly using both gang and vectors via the current compilers.

Sorry, I could not clearly understand the change in pointer handling in OpenACC which you explained. I tried the option “-gpu=managed”, but the output from the compiler and the compiled program were the same as those without the option.

regards,

Masahiro Kondo

MatColgrove · June 13, 2022, 5:09pm

Thanks! Finally determined the difference. You’re using C++ (fails) while I was using C (works). Looks like a compiler code generation issue with how it’s accessing “previous”. I’ve filed a problem report, TPR #31948, and sent to engineering for investigation.

As a work-around, you can hoist the declaration of “previous” and then manually privatize it. Note “managed” only works with dynamically allocated data, so would be ineffective in this case.

Example:

% cat main.cpp
#include <stdio.h>
#include <stdlib.h>

#define DIM 3
#define N 10
const int W=3;

int main(int argc, char *argv){

double position[N][DIM];
for(int iP=0;iP<N;++iP){
position[iP][0]=iP/W;
position[iP][1]=iP%W;
position[iP][2]=0.0;
}

for(int iP=0;iP<N;++iP){
printf("%d: %e %e %e\n",iP,position[iP][0],position[iP][1],position[iP][2]);
}
printf("\n");
#ifdef WORKS
double previous[DIM];
#endif

#pragma acc data copy(position[0:N][0:DIM])
{
#ifdef WORKS
#pragma acc kernels loop independent private(previous[:DIM])
#else
#pragma acc kernels loop independent
#endif
for(int iP=0;iP<N;++iP){
#ifndef WORKS
double previous[DIM];
#endif
#pragma acc loop seq
for(int iD=0;iD<DIM;++iD){
previous[iD]=position[iP][iD];
}
position[iP][0]=previous[1];
position[iP][1]=previous[2];
position[iP][2]=previous[0];
}
}

for(int iP=0;iP<N;++iP){
printf("%d: %e %e %e\n",iP,position[iP][0],position[iP][1],position[iP][2]);
}
printf("\n");

return 0;
}
% nvc++ -acc -V22.5 main.cpp -Minfo=accel ; a.out
main:
     26, Generating copy(position[:][:]) [if not already present]
     32, Loop is parallelizable
         Generating NVIDIA GPU code
         32, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
         37, #pragma acc loop seq
     32, Local memory used for previous
0: 0.000000e+00 0.000000e+00 0.000000e+00
1: 0.000000e+00 1.000000e+00 0.000000e+00
2: 0.000000e+00 2.000000e+00 0.000000e+00
3: 1.000000e+00 0.000000e+00 0.000000e+00
4: 1.000000e+00 1.000000e+00 0.000000e+00
5: 1.000000e+00 2.000000e+00 0.000000e+00
6: 2.000000e+00 0.000000e+00 0.000000e+00
7: 2.000000e+00 1.000000e+00 0.000000e+00
8: 2.000000e+00 2.000000e+00 0.000000e+00
9: 3.000000e+00 0.000000e+00 0.000000e+00

0: 0.000000e+00 0.000000e+00 0.000000e+00
1: 0.000000e+00 1.000000e+00 0.000000e+00
2: 0.000000e+00 2.000000e+00 0.000000e+00
3: 1.000000e+00 0.000000e+00 0.000000e+00
4: 1.000000e+00 1.000000e+00 0.000000e+00
5: 1.000000e+00 2.000000e+00 0.000000e+00
6: 2.000000e+00 0.000000e+00 0.000000e+00
7: 2.000000e+00 1.000000e+00 0.000000e+00
8: 2.000000e+00 2.000000e+00 0.000000e+00
9: 3.000000e+00 0.000000e+00 0.000000e+00

% nvc++ -acc -V22.5 main.cpp -Minfo=accel -DWORKS ; a.out
main:
     26, Generating copy(position[:][:]) [if not already present]
     32, Loop is parallelizable
         Generating NVIDIA GPU code
         32, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
         37, #pragma acc loop seq
     32, Local memory used for previous
0: 0.000000e+00 0.000000e+00 0.000000e+00
1: 0.000000e+00 1.000000e+00 0.000000e+00
2: 0.000000e+00 2.000000e+00 0.000000e+00
3: 1.000000e+00 0.000000e+00 0.000000e+00
4: 1.000000e+00 1.000000e+00 0.000000e+00
5: 1.000000e+00 2.000000e+00 0.000000e+00
6: 2.000000e+00 0.000000e+00 0.000000e+00
7: 2.000000e+00 1.000000e+00 0.000000e+00
8: 2.000000e+00 2.000000e+00 0.000000e+00
9: 3.000000e+00 0.000000e+00 0.000000e+00

0: 0.000000e+00 0.000000e+00 0.000000e+00
1: 1.000000e+00 0.000000e+00 0.000000e+00
2: 2.000000e+00 0.000000e+00 0.000000e+00
3: 0.000000e+00 0.000000e+00 1.000000e+00
4: 1.000000e+00 0.000000e+00 1.000000e+00
5: 2.000000e+00 0.000000e+00 1.000000e+00
6: 0.000000e+00 0.000000e+00 2.000000e+00
7: 1.000000e+00 0.000000e+00 2.000000e+00
8: 2.000000e+00 0.000000e+00 2.000000e+00
9: 0.000000e+00 0.000000e+00 3.000000e+00

-Mat

MasahiroKondo64929 · June 14, 2022, 2:45am

Dear Mat Colgrove

Thank you for the reply and for posting this issue to the problem report TPR#31948.
I confirmed that the output of the test program was correct with

moving the definition “double previous[DIM]” out of the parallel loop
adding the directive “private(previous[0:DIM]” for the parallel loop

Could I see the progress with respect to #TRP31948 via this forums?

regards,

Masahiro Kondo

MatColgrove · June 14, 2022, 3:32pm

No, sorry. If you file an issue directly through NVBugs, then engineers can post comments that are visible to you. However my team still uses the PGI TPR system, which we’ve been using for about 30 years now. Lots of history that we didn’t want to lose after NVIDIA bought us, so we kept it. The downside being we don’t have a way to make it externally visible.

I do post notification after an issue has been resolved in a release.

MasahiroKondo64929 · June 15, 2022, 1:41am

I understood. I’ll wait for the further notice from you.
Thank you very much!

Topic		Replies	Views
Privatization of array Legacy PGI Compilers	9	17686	July 14, 2010
Code execution depends strangely on irrelevant parameters Legacy PGI Compilers	8	8162	October 22, 2013
Questions on incorrect results with openacc in GPU nvc, nvc++ and nvfortran	33	2723	December 4, 2023
How to parallelize this loop... Legacy PGI Compilers	14	7943	December 18, 2012
Parallelizing a loop Legacy PGI Compilers	9	5569	March 1, 2016
Pgfortran 20.4 and OpenACC giving "cudaLaunchKernel returned status 2: out of memory" Legacy PGI Compilers	17	1077	June 3, 2021
How to not parallelize inner loops in OpenACC ? Legacy PGI Compilers	7	3807	May 1, 2020
Acc loop with a FORTRAN subroutine call (acc routine worker) is not parallelized! Legacy PGI Compilers	18	2032	April 6, 2022
Need advices for optimizing heart of CFD code Legacy PGI Compilers	11	7144	July 13, 2016
OPENACC changes value of array Legacy PGI Compilers	12	9810	May 17, 2016

Shared memory is not correctly used in kernels block

Related topics