Unexpected flow graph

Hello,

I work on a Fortran MPI application that I want to accelerate using OpenACC directives.

I compile the code like this : (pgf90 version 16.4 running on Linux)
pgf90 -O3 -Mvect -Munroll -Mextend -Mmpi=sgimpi -mcmodel=medium -traceback -Mpreprocess -acc -Minfo -Minfo=accel -DGM_GPU -c ./SRC/inter.f

I get the following output from the compiler :

PGF90-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unexpected flow graph (./SRC/inter.f: 89)
inter:
     55, Generating present(q(:,:,:,:),qp(:,:,:,:),dist(:,:,:,:),mask(:,:,:),xa(:),yb(:),zc(:),ineigh(:,:,:,:),jneigh(:,:,:,:),kneigh(:,:,:,:),xip(:,:,:),yip(:,:,:),zip(:,:,:),matinv(:,:,:,:))
     89, CUDA shared memory used for qip
         Accelerator kernel generated
         Generating Tesla code
         92, !$acc loop gang collapse(3) ! blockidx%x
         93,   ! blockidx%x collapsed
         94,   ! blockidx%x collapsed
        112, !$acc loop vector(128) ! threadidx%x
        133, !$acc loop vector(128) ! threadidx%x
        141, !$acc loop vector(128) ! threadidx%x
        144, !$acc loop vector(128) ! threadidx%x
        154, !$acc loop vector(128) ! threadidx%x
        163, !$acc loop vector(128) ! threadidx%x
        169, !$acc loop vector(128) ! threadidx%x
        181, !$acc loop vector(128) ! threadidx%x
        184, !$acc loop vector(128) ! threadidx%x
     92, Loop not vectorized/parallelized: too deeply nested
     93, Loop not vectorized/parallelized: too deeply nested
    107, Scalar last value needed after loop for m,n,kc at line 113
         Scalar last value needed after loop for gradflag at line 138
         Loop not vectorized/parallelized: potential early exits
    112, Loop is parallelizable
         Loop unrolled 5 times (completely unrolled)
    121, Loop unrolled 5 times (completely unrolled)
    125, Loop unrolled 5 times (completely unrolled)
    132, Loop is parallelizable
         Loop not vectorized: may not be beneficial
         Unrolled inner loop 4 times
         Generated 3 prefetches in scalar loop
    133, Loop is parallelizable
         Loop unrolled 5 times (completely unrolled)
    139, Loop is parallelizable
    140, Loop carried dependence of wp prevents parallelization
         Loop carried backward dependence of wp prevents vectorization
         Generated vector sse code for the loop
    141, Loop is parallelizable
         Loop unrolled 3 times (completely unrolled)
    144, Loop is parallelizable
         Loop unrolled 2 times (completely unrolled)
    152, Loop is parallelizable
    153, Loop carried dependence of wp prevents parallelization
         Loop carried backward dependence of wp prevents vectorization
         Generated vector sse code for the loop
    154, Loop is parallelizable
         Loop unrolled 5 times (completely unrolled)
    163, Loop is parallelizable
         Loop unrolled 5 times (completely unrolled)
    169, Loop is parallelizable
         Loop unrolled 5 times (completely unrolled)
    181, Loop is parallelizable
         Loop unrolled 3 times (completely unrolled)
    184, Loop is parallelizable
         Loop unrolled 2 times (completely unrolled)

The corresponding source code is :

#ifdef GM_GPU
 89 !$ACC PARALLEL LOOP COLLAPSE(3)
 90 !$ACC& PRIVATE(gradflag,my_iMask,kk,j,i,k,m,n,kc,l,small2,D1,D2,d3,delta,QIP,PHI,WP)
 91 #endif
 92       DO kk = 1, Nz
 93          DO j = 1, Ny
 94             DO i = 1, Nx
 95 !
 96                gradflag = 0
 97 !
 98                IF (mask(i,j,kk) == -1) THEN
 99                   my_iMask = my_iMask + 1
100 !
101                   D1 = Da * Xa(i)
102                   D2 = Db * Yb(j)
103                   D3 = Dc * Zc(kk)
104                   delta = MIN (D1, D2, D3)
105                   small2 = factor * delta
106 
107                   DO k = 1, ibcDim     ! Loop over all neighbours of IP
108                      m  = INeigh(k,i,j,kk)
109                      n  = JNeigh(k,i,j,kk)
110                      kc = KNeigh(k,i,j,kk)
111                      IF ( (dist(k,i,j,kk) <= small2) .AND. (mask(m,n,kc) == 0) ) THEN
112                         DO l = 1, 5
113                            QIP(l) = Q(m,l,n,kc)
114                         END DO
115                         GOTO 30
116                      END IF
117 
118 !*************  Fill PHI ******************
119                      IF (mask(m,n,kc) /= 0) THEN     ! If the neighbour of IP is a solid point
120                         gradflag = 1
121                         DO l = 1, 5
122                            PHI(l,k) = zero
123                         END DO
124                      ELSE
125                         DO l = 1, 5
126                            PHI(l,k) = Q(m,l,n,kc)
127                         END DO
128                      END IF
129                   END DO   ! DO k = 1, ibcDim
130 
131 !********** Finding out Weighting Parameters' values ****************
132                   DO k = 1, ibcDim
133                      DO l = 1, 5
134                         WP(l,k) = zero
135                      END DO
136                   END DO
137 
138                   IF (gradflag == 1) THEN
139                      DO aa = 1, ibcDim
140                         DO k = 1, ibcDim
141                            DO l = 1, 3
142                               WP(l,aa) = WP(l,aa) + MATINV(aa,k,my_iMask,1) * PHI(l,k)
143                            END DO
144                            DO l = 4, 5
145                               WP(l,aa) = WP(l,aa) + MATINV(aa,k,my_iMask,2) * PHI(l,k)
146                            END DO
147                         END DO
148                      END DO
149 !
150                   ELSE
151 !
152                      DO aa = 1, ibcDim
153                         DO k = 1, ibcDim
154                            DO l = 1, 5
155                               WP(l,aa) = WP(l,aa) + MATINV(aa,k,my_iMask,1) * PHI(l,k)
156                            END DO
157                         END DO
158                      END DO
159                   END IF
160 
161 !*********** Calculating Q at IP ********************
162                   IF (ibcDim == 4) THEN
163                      DO l = 1, 5
164                         QIP(l) = WP(l,1) * XIP(i,j,kk) * YIP(i,j,kk) + WP(l,2) * XIP(i,j,kk) + WP(l,3) * YIP(i,j,kk) + W    P(l,4)
165                      END DO
166 !
167                   ELSE
168 !
169                      DO l = 1, 5
170                         QIP(l) = WP(l,1) * XIP(i,j,kk) * YIP(i,j,kk) * ZIP(i,j,kk) + WP(l,2) * XIP(i,j,kk) * YIP(i,j,kk)
171      &                         + WP(l,3) * XIP(i,j,kk) * ZIP(i,j,kk)               + WP(l,4) * YIP(i,j,kk) * ZIP(i,j,kk)
172      &                         + WP(l,5) * XIP(i,j,kk)                             + WP(l,6) * YIP(i,j,kk)
173      &                         + WP(l,7) * ZIP(i,j,kk)                             + WP(l,8)
174                      END DO
175 !
176                   END IF
177 
178 !********* Putting appropriate values of Q at the ghost point ************
179    30             CONTINUE
180 !
181                   DO l = 1, 3
182                      Q(i,l,j,kk) = - QIP(l)
183                   END DO
184                   DO l = 4, 5
185                      Q(i,l,j,kk) =   QIP(l)
186                   END DO
187                END IF    ! IF (mask(i,j,kk) == -1) THEN
188 !
189             END DO
190          END DO
191       END DO
192 #ifdef GM_GPU
193 !$ACC END PARALLEL LOOP
194 #endif

First I do not understand the message “Unexpected flow graph” in my case, I find only few things about it.
Next, I do not understand why the messages about “scalar last value needed after loop” are present because these scalar variables m,n,kc,gradflag should be private.
Moreover, the message about QIP (“CUDA shared memory used for qip”) : this array is small and it should be private too.

 17      REAL(rp),  DIMENSION(5)    ::  QIP

Thanks for any/all the advice you can give.

Hi mguy44,

First I do not understand the message “Unexpected flow graph” in my case, I find only few things about it.

It’s an internal compiler error. Can you please send a reproducing example to PGI Customer Service (trs@pgroup.com) so we can investigate?

Next, I do not understand why the messages about “scalar last value needed after loop” are present because these scalar variables m,n,kc,gradflag should be private.

The compiler is trying to parallelize your inner loops. For the most part they are getting parallelized, but in a few cases it can’t due to these scalar dependencies.

Though, I’m thinking that it might better to only parallelize the outer loops. Try adding “gang vector” to your “parallel loop” directive:

!$ACC PARALLEL LOOP COLLAPSE(3) GANG VECTOR



Moreover, the message about QIP (“CUDA shared memory used for qip”) : this array is small and it should be private too.

It is private, though private to the gang and shared between vectors. For gang private arrays the compiler will try to put these into shared memory since it’s faster to access than global memory.

Adding “gang vector” will eliminate this since the “private” would now apply to the “vector” instead of just the “gang”.

Note that scalar’s are private by default so no need to add them to the “private” clause. While there’s a few cases where you do need to add them to “private”, if you don’t then the compiler can declare them local to the kernel and thus increase the likelihood they’ll be put in a register.

  • Mat

Hello Mat,

Thank you for your message.

I have simplified the dependances of this routine in order to get a reproducing example to PGI Customer Service. I’ll send it.

I tried the GANG VECTOR but the error remained.

Regarding the scalar variables, the “GANG VECTOR” has no effect, I still have the messages " Scalar last value needed after loop for n at line …"

My goal is to parallelize the outer loops. I do not understand why the compiler wants to do something else. I put an “ACC PARALLEL LOOP” directive just before the three nested outer loops for this.

Guy.

Hello Mat,

Following your advice, I remove the GOTO 30 statement and replace it with a boolean test :

 56                   l_fly = .TRUE.
 57                   k1_loop: DO k = 1, ibcDim     ! Loop over all neighbours of IP
 58                      m  = INeigh(k,i,j,kk)
 59                      n  = JNeigh(k,i,j,kk)
 60                      kc = KNeigh(k,i,j,kk)
 61                      IF ( (dist(k,i,j,kk) <= small2) .AND. (mask(m,n,kc) == 0) ) THEN
 62                         DO l = 1, 5
 63                            QIP(l) = Q(m,l,n,kc)
 64                         END DO
 65                         l_fly = .FALSE.
 66                         EXIT k1_loop
 67 !!!                     GOTO 30
 68                      END IF
 ...
 81                   END DO k1_loop  ! DO k = 1, ibcDim
 82 
 83                   IF (l_fly) THEN
...
                  END IF    ! if l_fly
132 !********* Putting appropriate values of Q at the ghost point ************
133 !!!30             CONTINUE
134 !

The output of the compiler is the following :

pgf90  -O3 -Mvect -Munroll -Mextend  -mcmodel=medium -traceback -Mpreprocess   -acc -Minfo -Minfo=accel -DGM_GPU -DGM_NO_IO -c my_module.f inter.f 
my_module.f:
inter.f:
PGF90-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unexpected flow graph (inter.f: 37)
inter:
     28, Generating present(q(:,:,:,:),qp(:,:,:,:),dist(:,:,:,:),mask(:,:,:),xa(:),yb(:),zc(:),ineigh(:,:,:,:),jneigh(:,:,:,:),kneigh(:,:,:,:),xip(:,:,:),yip(:,:,:),zip(:,:,:),matinv(:,:,:,:))
     37, Accelerator kernel generated
         Generating Tesla code
         41, !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
         42,   ! blockidx%x threadidx%x collapsed
         43,   ! blockidx%x threadidx%x collapsed
         57, !$acc loop seq
         62, !$acc loop seq
         73, !$acc loop seq
         77, !$acc loop seq
         85, !$acc loop seq
         86, !$acc loop seq
         92, !$acc loop seq
         93, !$acc loop seq
         94, !$acc loop seq
         97, !$acc loop seq
        105, !$acc loop seq
        106, !$acc loop seq
        107, !$acc loop seq
        116, !$acc loop seq
        122, !$acc loop seq
        135, !$acc loop seq
        138, !$acc loop seq
     41, Loop not vectorized/parallelized: too deeply nested
     42, Loop not vectorized/parallelized: too deeply nested
     57, Scalar last value needed after loop for m at line 63
         Scalar last value needed after loop for n at line 63
         Scalar last value needed after loop for kc at line 63
         Scalar last value needed after loop for gradflag at line 91
         Loop not vectorized/parallelized: potential early exits
     62, Loop is parallelizable
         Loop unrolled 5 times (completely unrolled)
     73, Loop unrolled 5 times (completely unrolled)
     77, Loop unrolled 5 times (completely unrolled)
     85, Loop is parallelizable
         Loop not vectorized: may not be beneficial
         Unrolled inner loop 4 times
         Generated 3 prefetches in scalar loop
     86, Loop is parallelizable
         Loop unrolled 5 times (completely unrolled)
     92, Loop is parallelizable
     93, Loop carried dependence of wp prevents parallelization
         Loop carried backward dependence of wp prevents vectorization
         Generated vector sse code for the loop
     94, Loop is parallelizable
         Loop unrolled 3 times (completely unrolled)
     97, Loop is parallelizable
         Loop unrolled 2 times (completely unrolled)
    105, Loop is parallelizable
    106, Loop carried dependence of wp prevents parallelization
         Loop carried backward dependence of wp prevents vectorization
         Generated vector sse code for the loop
    107, Loop is parallelizable
         Loop unrolled 5 times (completely unrolled)
    116, Loop is parallelizable
         Loop unrolled 5 times (completely unrolled)
    122, Loop is parallelizable
         Loop unrolled 5 times (completely unrolled)
    135, Loop is parallelizable
         Loop unrolled 3 times (completely unrolled)
    138, Loop is parallelizable
         Loop unrolled 2 times (completely unrolled)

The boolean variable l_fly is a local variable and I put it in the ACC PRIVATE clause.

Guy.

Hi Guy,

Unfortunately it’s the early exit from the k1_loop that’s causing the problem and not the use of GOTO. Hence, using EXIT results in the same error.

For the work around, you’ll need to do something like:

               k1_loop:  DO k = 1, ibcDim     ! Loop over all neighbours of IP
                   if (l_fly) then
                     m  = INeigh(k,i,j,kk)
                     n  = JNeigh(k,i,j,kk)
                     kc = KNeigh(k,i,j,kk)
                     IF ( (dist(k,i,j,kk) <= small2) .AND. (mask(m,n,kc) == 0) ) THEN
                        DO l = 1, 5
                           QIP(l) = Q(m,l,n,kc)
                        END DO
!                        GOTO 30
                        l_fly = .false.
                     END IF
...
                     END IF
                   end if ! end l_fly
                  END DO k1_loop  ! DO k = 1, ibcDim

                   if (l_fly) then
...

Best Regards,
Mat

Hello Mat,

Thank you for your idea, it works. The code compiles. My problem is solved.

Regards,
Guy.

The original problem has been fixed in our 2019 compilers, starting with PGI 19.1.