The result changes, after adding the openacc statement

kingpo · July 14, 2018, 2:55am

After parallelizing the following code, the result is different from that before parallelization. Why? How to parallelize the program?

!$acc parallel loop vector
       do k=n_start,n_end
         if (iflag(k).ne.0) then
            dx = xp(k) - xmin
            dz = zp(k) - zmin
            icell = int( dx * one_over_3h ) + 1
            kcell = int( dz * one_over_3h ) + 1
            ii    = icell + (kcell - 1)*ncx
!$acc atomic capture
           nc(ii,kind_p) = nc(ii,kind_p)+1
           ibox(ii,kind_p,nc(ii,kind_p))=k
!$acc end atomic
         endif
      enddo
!$acc end parallel

kingpo · July 16, 2018, 1:18am

28, Generating implicit copy(ibox(:,kind_p,:))
Generating implicit copyin(iflag(n_start:n_end),xp(n_start:n_end),zp(n_start:n_end))
Generating implicit copy(nc(:,kind_p))
29, Complex loop carried dependence of nc prevents parallelization
Loop carried dependence due to exposed use of nc(:,kind_p),ibox(:,kind_p,:) prevents parallelization
Accelerator scalar kernel generated
Accelerator kernel generated
Generating Tesla code
29, !$acc loop seq

MatColgrove · July 16, 2018, 3:43pm

Hi kingpo,

There’s not enough information here to give you a diffinative answer and wrong answers could be due to various reasons. It’s possible that data isn’t properly being copied between the host and device, or it could be a race condition in your code. If you could post a full reproducing example, that would be helpful.

I do notice that your atomic capture is in an incorrect form. Try capturing the value of the array to a local variable before using it as an index into ibox.

Something like:

!$acc parallel loop vector 
       do k=n_start,n_end 
         if (iflag(k).ne.0) then 
            dx = xp(k) - xmin 
            dz = zp(k) - zmin 
            icell = int( dx * one_over_3h ) + 1 
            kcell = int( dz * one_over_3h ) + 1 
            ii    = icell + (kcell - 1)*ncx 
!$acc atomic capture 
           nc(ii,kind_p) = nc(ii,kind_p)+1 
           idx = nc(ii,kind_p) 
!$acc end atomic 
!$acc atomic write 
           ibox(ii,kind_p,idx)=k 
         endif 
      enddo 
!$acc end parallel

-Mat

kingpo · July 17, 2018, 12:32pm

Thanks for your answerï¼� It still not right.
I think this is the problem of openacc statement. Once I add the openacc statement, the result is wrong, without the openacc statement, the result is right. My code is part of a large project that contains more than 30 subroutine.F files. Can I upload the entire project code or only upload the subroutine code
As I don’t know how to upload files, I will post some code below.

   subroutine step
      include 'common.2D'

      call ini_divide(2)
      call divide(nbp1,npt,2)

      end

 subroutine ini_divide(kind_p)
      include 'common.2D'

!$acc declare present(nc(:,:),ibox(:,:,:),nct)

!$acc update device(nct)

!$acc kernels copyin(nplink_max)
      do i=1,nct
            nc(i,kind_p)  = 0
            ibox(i,kind_p,1:nplink_max)  = 0
      enddo
!$acc end kernels

!$acc update host(nc(:,:),ibox(:,:,:))

      return

      end

 subroutine divide(n_start,n_end,kind_p)
      include 'common.2D'

!$acc parallel loop vector
       do k=n_start,n_end
         if (iflag(k).ne.0) then
            dx = xp(k) - xmin
            dz = zp(k) - zmin
            icell = int( dx * one_over_3h ) + 1
            kcell = int( dz * one_over_3h ) + 1
            ii    = icell + (kcell - 1)*ncx
!$acc atomic capture
           nc(ii,kind_p) = nc(ii,kind_p)+1
           idx = nc(ii,kind_p)
!$acc end atomic
!$acc atomic write
           ibox(ii,kind_p,idx)=k
         endif
      enddo
!$acc end parallel

       return
       end

MatColgrove · July 17, 2018, 3:04pm

Are you updating the host data for “nc” and “ibox” someplace higher in the code?

Once I add the openacc statement, the result is wrong, without the openacc statement, the result is right.

Can you give more detail about the wrong answer? Are you getting zero’s in the results? Garbage values? Slightly incorrect answers?

Zero’s or garbage values are most likely problems with not synchronizing the host and device copies of the data.

Slightly incorrect answers are more likely to be a problem with the compute loops, such as a race condition.

-Mat

kingpo · July 18, 2018, 1:00am

Thank you very much for your advice. I’ve found the problem. It’s a data transmission problem. I have fixed it.
But the result is still slightly wrong. Now, the result is correct when using the ‘seq’ statement in the parallel area, but when using the ‘vector’ statement, the result will have a smile error.

MatColgrove · July 18, 2018, 2:05pm

Which result is slightly wrong? Ibox?

The order in which Ibox is updated is non-determinsitic so may result in different values when run in parallel.

kingpo · July 19, 2018, 1:27am

how can i fixed this problem?

MatColgrove · July 19, 2018, 12:55pm

Hi kingpo,

Since I have incomplete information, it’s difficult for me to offer advice here. From the information given, it seems likely that this loop is not parallelizable and you should run it serially. Perhaps there’s an alternative algorithm you can use?

-Mat

kingpo · July 20, 2018, 1:42am

The following code is where the ibox and nc data has been used. Is the previous loop running serial only?

     subroutine celij(j1,j2,kind_p1,ini_kind_p2,lx2)
c
      include 'common.2D'  
       
      do kind_p2=ini_kind_p2,2
        if(nc(j2,kind_p2).ne.0) then

        do ii=1,nc(j1,kind_p1)
          i = ibox(j1,kind_p1,ii)
         
          do jj=1,nc(j2,kind_p2)
           j = ibox(j2,kind_p2,jj)
            
            drx = xp(i) - xp(j)
            drz = zp(i) - zp(j)

            call periodicityCorrection(i,j,drx,drz,lx2)

            rr2 = drx*drx + drz*drz

            !if(rr2.lt.fourh2.and.rr2.gt.1.e-18) then
            if(rr2.lt.enineh2.and.rr2.gt.1.e-18) then
             dux = up(i) - up(j)
             duz = wp(i) - wp(j)

c            Calculating kernel & Normalized Kernel Gradient
             call kernel(drx,drz,i,j,j1,j2,rr2) 
             call kernel_correction(i,j)
           
.............................
         enddo
        enddo
	 endif
	enddo

      end

MatColgrove · July 20, 2018, 4:33pm

Is the previous loop running serial only?

If you’re using the “parallel” directive, then no, it would be running in parallel. To have it run serially, use the “serial” directive.

I may be of better help if you can send me the full source. Can you send it to PGI Customer Service at support@pgroup.com?

-Mat

kingpo · July 21, 2018, 8:43am

Thanks Mat! I have send my codes to support@pgroup.com. Please check it!

Topic		Replies	Views
Parallelizing a loop Legacy PGI Compilers	9	5568	March 1, 2016
OpenACC Parallel Region nvc, nvc++ and nvfortran cuda	2	596	December 21, 2021
Add OpenACC to a Fortran loop Legacy PGI Compilers	5	7212	December 3, 2015
Vector array assignments within a $acc parallel region Legacy PGI Compilers	13	11030	November 27, 2013
OPENACC changes value of array Legacy PGI Compilers	12	9808	May 17, 2016
OpenACC equivalent of OpenMP accellerated code Legacy PGI Compilers	1	2255	May 28, 2019
parallelize a fortran loop. Legacy PGI Compilers	7	6540	December 11, 2015
prevent parallelization Legacy PGI Compilers	3	1963	March 22, 2012
Question: The OpenAcc directive "!$acc parallel loop" does not work in Community Edition 19.4 Legacy PGI Compilers	3	2847	April 1, 2020
Questions on incorrect results with openacc in GPU nvc, nvc++ and nvfortran	33	2719	December 4, 2023

The result changes, after adding the openacc statement

Related topics