No array assignment replaced by call to pgf90_mcopy4 in 10.2

Hello,

I compiled a small accelerator application with PGI 10.2 (fortran, on linux, nvidia geforce gt220) and had to recognize that it is slower than using PGI 10.1.

I figured out that this is because of an whole array assignment (in an accelerator compute region) which PGI 10.1 seemed to replace by an internal method (“Memory copy idiom, array assignment replaced by call to pgf90_mcopy4”), but PGI 10.2 generates an extra kernel with a loop for that (“Loop is parallelizable, Accelerator kernel generated, 70, !$acc do parallel, vector(16)”). This extra kernel takes 20% of my whole application GPU time (the internal function uses only 3-5% of the time).

I also just flicked through the bug fixes in PGI 10.3, but couldn’t find anything about that issue.

Does anyone knows anything about that?

Hi Xray,

While I don’t know specifics, I can guess as to what’s going on.

One complaint we have had about the Accelerator model was that it was too restrictive. If the programmer puts a region of code within the directives, then the compiler should make its best attempt to off-load this code to the GPU. (Before it would exclude sections of code which may not benefit from acceleration). In cases where idiom recognition inhibits acceleration, the idiom recognition should be disabled. While I’m not positive, this sounds like a change that would have occurred around the 10.2 release.

What I would want to know is why this extra kernel is taking longer. Does the code need to perform an extra copy? If so, can you use data regions to keep the data on the GPU? Should this section of code be left on the host? Can you use the “host” clause to tell the compiler to keep it on the host, or add an “!$acc end region”/“!$acc region” before an after this section?

Granted, it could just be a bug. Feel free to send in a report to PGI Customer Service (trs@pgroup.com) and include sample code.

Hope this helps,
Mat

Hi,

I’m working with Xray on this problem. The code is a simple jacobi solver example. We built versions with 10.1 and 10.2 and used Nvidia’s Cuda Profiler to see what’s going on. It seems the main compute kernel (jacobi_72_gpu) takes much longer with 10.2 for reasons we don’t know and there’s the additional kernel generated by 10.2 and additional memcpy calls, too. Here are the GPU times as reported by the profiler:

10.1:

Method #calls GPU usecs

jacobi_72_gpu 20 716924
jacobi_72_gpu_red 20 398.529
memcpyHtoD 62 80201.9
memcpyDtoH 21 39726.9

10.2:

jacobi_72_gpu 20 1.34378e+06
jacobi_66_gpu 20 258345
jacobi_72_gpu_red 20 398.432
memcpyHtoD 82 81118.1
memcpyDtoH 21 39487.6

This is the code:


!$acc data region local(uold) copyin(afF) copy(afU)
            do while (iIterCount < iIterMax .and. residual > fTolerance)
                residual = 0.0d0

!$acc region        

                ! Copy new solution into old
                uold = afU

!$acc do parallel private(j)
                  ! Compute stencil, residual, & update
                   do j = 1, iRows - 2
!$acc do private (i,fLRes) vector(256)
                       do i = 1, iCols - 2
                           ! Evaluate residual 
                           fLRes = (ax * (uold(i-1, j) + uold(i+1, j)) &
                                  + ay * (uold(i, j-1) + uold(i, j+1)) &
                                  + b * uold(i, j) - afF(i, j)) / b
                    
                           ! Update solution 
                           afU(i, j) = uold(i, j) - fRelax * fLRes
                    
                           ! Accumulate residual error
                           residual = residual + fLRes * fLRes
                       end do
                   end do
!$acc end region        

                 ! Error check 
                 iIterCount = iIterCount + 1      
                 residual = SQRT(residual) / REAL(iCols * iRows)
             
            ! End iteration loop 
            end do
!$acc end data region

So we are loosing performance here when going from 10.1 to 10.2. Are we doing anything too strange for the compiler?

Thanks for your help!
Boris

Hi Boris,

The difference between the code generated by 10.1 and 10.2 is that the copy statement “uold = afU” was being performed on the host while in 10.2, it’s been moved to the GPU. Also, in order to not get wrong answers in 10.1, the compiler is needing to copy uold and afU for each iteration of the loop.

In other words, 10.2 is correctly matching what you have in written. I just happens the 10.1 method of performing the mcopy of the host was better for your code (I’m assuming the arrays are fairly small). To recreate the 10.1 behavior, try removing the “data region” and moving the copy out of the “acc region”.

!acc data region local(uold) copyin(afF) copy(afU)
            do while (iIterCount < iIterMax .and. residual > fTolerance)
                residual = 0.0d0

                ! Copy new solution into old
                uold = afU

!$acc region       
                  ! Compute stencil, residual, & update
                   do j = 1, iRows - 2
!$acc do vector(256)
                       do i = 1, iCols - 2
                           ! Evaluate residual
                           fLRes = (ax * (uold(i-1, j) + uold(i+1, j)) &
                                  + ay * (uold(i, j-1) + uold(i, j+1)) &
                                  + b * uold(i, j) - afF(i, j)) / b
                   
                           ! Update solution
                           afU(i, j) = uold(i, j) - fRelax * fLRes
                   
                           ! Accumulate residual error
                           residual = residual + fLRes * fLRes
                       end do
                   end do
!$acc end region       

                 ! Error check
                 iIterCount = iIterCount + 1     
                 residual = SQRT(residual) / REAL(iCols * iRows)
             
            ! End iteration loop
            end do
!acc end data region

Side note, scalar variables are private by default so you don’t need the private clauses. It doesn’t hurt, but is not necessary.

Hope this helps,
Mat

Hi Mat,

thanks for your help. I guess I now understand what’s going on here, but I still can’t get the same performance with 10.2 which I had with 10.1. The matrices are 5000x5000 single precision which makes them ~95 MiB each. I tried out your suggestions:

The original code reaches 3500 MFlops with 10.2 (4200 MFlops with 10.1). Removing the data region lowers the performance to 1300 MFlops. The profiler clearly shows the reason: The program now spends almost 70 % of its time doing data copies between host and device memory which is obvious, because it now needs to copy the matrices in each loop iteration. This was the reason we put the data region around the outer do-while-loop in the first place.

Additionally moving the copy out of the compute region, getting the code you posted, increases the performance a little bit to ~1400 MFlops. This can be attributed to the additional copy kernel being removed and less data copies between host and device (why’s that?).

Leaving the data region in but moving the copy out of the compute region gives us around 3000 MFlops. As far as I understand, this should make 10.2 do the same as 10.1 did to the original code, but we still have significantly less performance. Looking at the profiles for this version with 10.2 and the original version with 10.1, the graphical “GPU time height” plot looks similar, but on 10.2, the main compute kernel (not the one created for the reduction) takes much longer than the data movement operations whereas for 10.1 it is the other way around. For the 10.1 version, the kernel runs in around 720000 usecs, while it needs about twice as long when compiled with 10.2. I don’t uderstand why. The compiler messages look identical: same loop schedules, same size of cached references. Do you have an explanation for this?

Thanks again!
Boris

Short addendum: We’ve just installed 10.3 and I’ve used it to compile our original version and the version with the copy outside the compute region. The performance is no different than with 10.2.

Boris

Hi Boris,

Can you please send the full source to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me? I’ll need to see the full context to determine what’s else could be going on.

Thanks,
Mat

Done.

Hi xray and Boris,

It appears to me that the reduction code for ‘residual’ is taking twice as long and is what is causing the slow down. I have sent a report (TPR#16728) to our engineers for further investigation.

Note that the 10.2 code to copy “uold = afU” does appear to be much faster than 10.1. I show a significant speed-up with 10.1 when I change the code to match what 10.2 does.:

!acc data region local(uold) copyin(afF) copy(afU)
            do while (iIterCount < iIterMax .and. residual > fTolerance)
                residual = 0.0d0

                ! Copy new solution into old
                !uold = afU 
!$acc do parallel
                  ! Compute stencil, residual, & update
                   do j = 0, iRows
!$acc do vector(256)
                       do i = 0, iCols
                          uold(i,j) = afU(i,j)
                       enddo
                    enddo
...

Thanks,
Mat

Hi xray and Boris,

Sorry for the late update on this one. In 10.4 we added back the relaxed divide when using the flag “-ta=nvidia,fastmath”. Using this flag should regain the lost performance. By default, we decided to keep the slower but more accurate division.

  • Mat