Different results on CPU and GPU

Hi,
For the code below, I get different results on the CPU and GPU. The differences appear in the array PFT. Repeated running will change the differences, and occasionally may produce no difference.

!$acc data copy(NWALLS, MAX_PIP, WALLDTSPLIT, WALLCONTACT(1:MAX_PIP,1:NWALLS), PEA(1:MAX_PIP,1:4), &
!$acc&          DES_POS_NEW(1:MAX_PIP,1:DIMN), W_POS_L(1:MAX_PIP,1:NWALLS,1:DIMN), PFT(1:MAX_PIP,1:MAXNEIGHBORS,1:DIMN) )
!$acc parallel
!$acc loop gang, private(LL, NI, IW, PFT_TMP(1:DIMN), DIST(1:DIMN) )

      DO LL = 1, MAX_PIP
         IF(.NOT.PEA(LL,1) .OR. PEA(LL,4) ) CYCLE

         DO IW = 1, NWALLS

               IF(.NOT.WALLDTSPLIT .OR. PEA(LL,2) .OR. PEA(LL,3) .OR. WALLCONTACT(LL,IW).NE.1 ) GOTO 200

               NI=IW !Line added by AJ for debugging
               DIST(:)=ZERO  !Line added by AJ for debugging
               DIST(:) = w_pos_l(LL,IW,:) - DES_POS_NEW(LL,:)

! Save the tangential displacement history with the correction of Coulomb's law
                  PFT_TMP(:)=DIST(:)
                  IF (PARTICLE_SLIDE) THEN
                  ELSE
                     PFT(LL,NI,:) = PFT_TMP(:)
                  ENDIF

                  PARTICLE_SLIDE = .FALSE.

 200           CONTINUE
            ENDDO ! DO IW = 1, NWALLS
      ENDDO !Loop over particles LL to calculate wall contact
!$acc end parallel
!$acc end data

When I comment out the line

DIST(:) = w_pos_l(LL,IW,:) - DES_POS_NEW(LL,:)

I get identical PFT array from both CPU and GPU runs.

I checked that the arrays DES_POS_NEW and W_POS_L remain same even when PFT differs.

Note that the code pasted above is a stripped down version of a part of the file model/des/calc_force_des.f from the MFIX code.

Best
Anirban

Hi Anirban,

I don’t see anything obvious. Though since you indicate that the issue might be due to DIST, I’d try manually privatizing it as well as PFT_TMP (i.e. add a second dimension DIST(1:MAX_PIP,1:DIMN)). It will also improve performance since the data will now be accessed contiguously across threads.

FYI, NWALLS and MAX_PIP don’t need to be copied is scalars are implicitly firstprivate.

  • Mat

Hi Mat,
Thanks very much for the prompt feedback. Per you advice, I made the arrays DIST and PFT_TMP manually private.

 DOUBLE PRECISION, dimension(:,:),allocatable:: DIST, PFT_TMP
 ALLOCATE(DIST(MAX_PIP, DIMN) )
 ALLOCATE(PFT_TMP(MAX_PIP,DIMN) )
....
....
!$acc data copy(NWALLS, MAX_PIP, WALLDTSPLIT, WALLCONTACT(1:MAX_PIP,1:NWALLS), PEA(1:MAX_PIP,1:4), &
!$acc&          DES_POS_NEW(1:MAX_PIP,1:DIMN), W_POS_L(1:MAX_PIP,1:NWALLS,1:DIMN), PFT(1:MAX_PIP,1:MAXNEIGHBORS,1:DIMN), &
!$acc&          DIST(1:MAX_PIP,1:DIMN), PFT_TMP(1:MAX_PIP,1:DIMN) )
!$acc parallel
!$acc loop gang, private(LL, NI, IW )

      DO LL = 1, MAX_PIP
         IF(.NOT.PEA(LL,1) .OR. PEA(LL,4) ) CYCLE

         DO IW = 1, NWALLS

               IF(.NOT.WALLDTSPLIT .OR. PEA(LL,2) .OR. PEA(LL,3) .OR. WALLCONTACT(LL,IW).NE.1 ) GOTO 200

               NI=IW !Line added by AJ for debugging
               DIST(LL,:)=ZERO  !Line added by AJ for debugging
               DIST(LL,:) = w_pos_l(LL,IW,:) - DES_POS_NEW(LL,:)

! Save the tangential displacement history with the correction of Coulomb's law
                  PFT_TMP(LL,:)=DIST(LL,:)
                  IF (PARTICLE_SLIDE) THEN
                  ELSE
                     PFT(LL,NI,:) = PFT_TMP(LL,:)
                  ENDIF

                  PARTICLE_SLIDE = .FALSE.

 200           CONTINUE
            ENDDO ! DO IW = 1, NWALLS
      ENDDO !Loop over particles LL to calculate wall contact
!$acc end parallel
!$acc end data

There is still a small difference between the CPU and GPU results, but now the difference stayed the same when I repeated the exercise 4 times.

363566c363566
<    0.000E+00   0.000E+00   0.000E+00
---
>    0.000E+00   0.120E-02   0.000E+00

On MAX_PIP, and NWALLS, I originally did not explicitly copy them, but I am doing so now just to be extra cautious. Does it hurt to do so? In fact, I would like them to be shared ( single copy on GPU) rather than firstprivate.

I can upload my branch of MFIX for your perusal. It will be a great help if it can be tested at PGI.

Best
Anirban

Does it hurt to do so? In fact, I would like them to be shared ( single copy on GPU) rather than firstprivate.

Putting scalars in a copy clause makes them global. firstprivate will create a local scalar in the kernel and increase the likelihood that it will put into a register. It may not matter much given there’s only one reference, but there will be one less global reference.

I can upload my branch of MFIX for your perusal. It will be a great help if it can be tested at PGI.

That should be fine. Once you have the port complete, it would be good to put the code into our QA testing. But before that, I can manually test the code. I’ll need to know the name of the CVS server since the one listed in the docs is either wrong or only accessible within ORNL. Though, let’s take any connection questions I may have offline.

  • Mat

Thanks much Mat.

Knowing the default treatment of scalars will definitely come in handy during the performance tuning phase. Now I recollect reading it in some other post on this forum.

Its best to ftp the branch of MFIX I am working with. You had sent me the following FTP site in an earlier post.
I was thinking of uploading here. Is that OK?

Best
Anirban

Hi Mat,
I uploaded a tarball named anirban_jana-pgi-test-2013-11-21.tgz to the pgroup ftp site. It has the code, a test case and a readme file with my email and an explanation of how to reproduce the issue.

Note that my compiler version is 13.3.

Thanks very much and looking forward to some insight from you
Best
Anirban

Hi Mat,
Just wondering if you already had a chance yet to look at the tarball I uploaded.

Of course, now its Thanksgiving. Have a great holiday.

Best
Anirban

Hi Anirban,

Sorry, I missed your post. I just sent a note to IT to grab the file for me. Though with the holiday, I probably wont be able to look at it till next week.

  • Mat

Hi Mat,
Great! Looking forward to your insight next week. Have a great Thanksgiving.

I should also warn you that the tarball is ~150MB, but most of it is due to the restart files for the test case.

Have a great Thanksgiving.

Anirban

Hi Anirban,

I was able to recreate the wrong answer when building with 13.3 but get correct answers when using 13.9. Can you please update to 13.9 or later to see if you get correct answers as well?

Thanks,
Mat

Hi Mat,
You are right. We installed 13.10, and this particular error seems resolved when building with it ( same CPU and GPU output obtained in all of 5 tries).

When trying to build and run the full code with 13.10, I am getting a new issue (cuEventSynchronize 700 error - launch failed). I’ll investigate this further, and start a new post on it if I cannot resolve it.

Thanks very much for your help.

Best
Anirban