PGI compiler much slower for a subroutine

vsingh96824 · August 22, 2016, 1:44pm

So I have a subroutine that goes something like this.

do tmp_index = 1, mesh%Ninterior_faces/chunksize
   do c_indx = 1, chunksize
          do var = 1, Nvar
             do p_indx = 1, P1
                
              tmpi_left(p_indx, var, c_indx) = &
                  data%uflux(left_indices_1 + (p_indx - 1), var, left)
              tmpi_rght(p_indx, var, c_indx) = &
                  data%uflux(rght_indices_1 + (p_indx - 1), var, rght)
            end do
        end do
     end do

   do c_indx = 1, chunksize
          do var = 1, Nvar
             do p_indx = 1, P1

             sfi_left = mesh%sfi(p_indx, left_face_index, left)
              sfi_rght = mesh%sfi(p_indx, rght_face_index, rght)

              uflux_left(p_indx, var, c_indx) = tmpi_left(sfi_left, var, c_indx)
              uflux_rght(p_indx, var, c_indx) = tmpi_rght(sfi_rght, var, c_indx)     
            end do
        end do
     end do

   call acc_evaluate_interaction_flux(uflux_left, uflux_rght, Fi, Fv)

   do c_indx = 1, chunksize
          do var = 1, Nvar
             do p_indx = 1, P1
               data%Fi(left_indices_1 + (sfi_left-1), var, left) = &
                                 Fi(p_indx, var, c_indx) +(Fv(p_indx, var, c_indx) + &
                                 ibeta_viscous*nx*Fv_dot_n)
         end do
      end do
   end do

end do

There are multiple subroutines in the code and I have converted two of them. The other subroutine goes much faster on the GPU than on the CPU after which I tried to convert this subroutine.

Now, the first problem. This particular subroutine runs much slower after compiling with gfortran than with PGI. It goes as much as 5 times slower.

The net result is that unlike the binary for gfortran, this particular subroutine becomes the costliest subroutine for the PGI binary.

Therefore even after putting the other subroutine on the GPU the code is much slower because of this subroutine.

The other problem is that while after some tuning I can get the subroutine to go faster on the GPU, but it goes faster only relative to the PGI run time, and is still much slower than the gfortran runtime.

MatColgrove · August 22, 2016, 4:36pm

Hi vsingh96824,

This particular subroutine runs much slower after compiling with gfortran than with PGI. It goes as much as 5 times slower

As written this line indicates that gfortran is slower, but given the rest of the post, I’m assuming you meant that PGI is 5x slower than gfortran?

Typically PGI is faster than gfortran and even when they are faster it only by 10%, not 5x. Hence, something seems off.

What flags are you using to compile?
Are you comparing CPU to CPU performance or CPU to GPU performance?
Have you profiled to code to see where the performance difference occurs?
Are both compilers executing the same code path?

Mat

vsingh96824 · August 22, 2016, 6:01pm

Hi Mat,

Sorry for the typo. Yes, gfortran is faster by about 5 times for this particular subroutine only.

This comparison is for CPU only for 20 threads using OpenMP.

The exact same code is being compiled, one with PGI, one with gfortran.

I am using both compilers with debug mode, so

-g -mp for PGI

and

-g -fno-range-check -O1 -fbounds-check -fbacktrace -Wuninitialized -Wunused -ffpe-trap=invalid -finit-real=nan -fopenmp

for gfortran

I timed the code, so I know the performance difference for this subroutine. How would I profile it to determine why there’s such a large performance difference between the compilers.

If you want to check it out, the subroutine is get_interaction_flux at the following page.

https://bitbucket.org/vsingh001/deepfry/src/b744c86fd98aa8a72336fa0533cdc73b273713c3/src/cns/system.f90?at=accel&fileviewer=file-view-default

MatColgrove · August 22, 2016, 9:39pm

What’s the performance using optimization, such as “-fast -mp” for PGI and “-O3 -fopenmp” for GNU?

I’d also recommend setting the environment variables “MP_BIND=yes”.

Using “-g” without an optimization (-On) flag uses -O0, or no optimization at all. Even with -O2, -g will disable some optimization. If you do need debugging symbols in the binary but still want the code to be optimized, add the “-gopt” flag.

vsingh96824 · August 23, 2016, 9:02am

The change is negligible with change in flags for either case, specifically for this subroutine. The other subroutines go slightly faster.

Here’s how timings for PGI binary look like. First number is time and second number percentage of total time.

 ... Timings
        RHS Eval                            :  0.1912391186E+00
            Interpolation                   :  0.5394935608E-02     2.821
            Ucommon                         :  0.1595973969E-02     0.835
               Boundary Ucommon             :  0.4498958588E-03     0.235
            Gradients                       :  0.3165102005E-01    16.550
            Discontinuous Flux              :  0.8643150330E-02     4.520
            Discontinuous Flux              :  0.8138895035E-02     4.256
            Interaction Flux                :  0.8494186401E-01    44.417
               Boundary interaction flux    :  0.1007699966E-01     5.269
            Flux Divergence                 :  0.4641699791E-01    24.272

Here’s what they look like for gfortran

 ... Timings
        RHS Eval                            :  0.7886579633E-01
            Interpolation                   :  0.2599807456E-02     3.296
            Ucommon                         :  0.1370556653E-02     1.738
               Boundary Ucommon             :  0.1436527818E-03     0.182
            Gradients                       :  0.2036497183E-01    25.822
            Discontinuous Flux              :  0.3988858312E-02     5.058
            Discontinuous Flux              :  0.5238423124E-02     6.642
            Interaction Flux                :  0.1297307201E-01    16.450
               Boundary interaction flux    :  0.1285634935E-02     1.630
            Flux Divergence                 :  0.2927677147E-01    37.122

As you can see, the Interaction flux timing is way slower. Just to be sure that this was not happening because the problem size is too small, I tried with a much bigger problem size. And the difference remain.

MatColgrove · August 23, 2016, 10:00pm

I was able to build your application but am having trouble running it. Seems like I’m missing a hdf file (See below).

I noticed that you are compiling with “-pg” which will instrument your code with profile counters and can have a performance impact. It could also account why you’re not seeing any performance gains with optimization if the time is dominated by profiling.

Here’s the runtime error I’m getting:

examples/CNS/euler_vortex% main.Linux.pgfortran.gomp.atlas.exe vortex.in
 OPENING MESHFILE irreg.hdf5
HDF5-DIAG: Error detected in HDF5 (1.8.14) thread 0:
  #000: H5F.c line 604 in H5Fopen(): unable to open file
    major: File accessibilty
    minor: Unable to open file
  #001: H5Fint.c line 990 in H5F_open(): unable to open file: time = Tue Aug 23 14:52:13 2016
, name = 'irreg.hdf5', tent_flags = 1
    major: File accessibilty
    minor: Unable to open file
  #002: H5FD.c line 992 in H5FD_open(): open failed
    major: Virtual File Layer
    minor: Unable to initialize object
  #003: H5FDsec2.c line 343 in H5FD_sec2_open(): unable to open file: name = 'irreg.hdf5', errno = 2, error message = 'No such file or directory', flags = 1, o_flags = 2
    major: File accessibilty
    minor: Unable to open file
main.Linux.pgfortran.gomp.atlas.exe: H5I.c:1111: H5I_get_type: Assertion `ret_value >= H5I_BADID && ret_value < H5I_next_type' failed.
Abort

vsingh96824 · August 24, 2016, 8:36am

Hi Mat,

Sorry about that. I created a new branch with the HDF5 file.

The source is now at

https://bitbucket.org/vsingh001/deepfry/src/4e0b3659273a?at=gTest

The folder is

examples/CNS/tgv

Compilation and running is through

make COMP=PGI OMP=t DIM3=t
./main.Linux.PGI.debug.gomp.3d.exe input_16

I don’t get the -pg thing. I am not using it with the PGI compiler at all.

MatColgrove · August 26, 2016, 3:24pm

Ok, I was able to check out our code and ran it on my local server.

It looks like your time is dominated by your calls to DGEMM and explains why compiler optimization or even the compiler doesn’t matter. The Blas library we ship is from OpenBlas and quite fast but maybe not as fast as the one you’re using with gfortran?

Can you try linking the PGI binary with the same Blas library that you use with Gfortran and see if that makes a difference?

Mat

vsingh96824 · August 27, 2016, 2:54pm

Hi Mat,

Thanks for the comments.

The time spent is indeed dominated by BLAS calls, but this is not true specifically for the Interaction flux subroutine that is taking a significant chunk of time in the PGI compiler case. There are no BLAS calls in this subroutine.

Also, I tried linking MKL with PGI for the BLAS part. There was almost no difference in speedup from default BLAS calls. This is surprising since, BLAS calls are much faster with gfortran and MKL.

But in general, I am not concerned about BLAS calls right now, because cuBLAS calls are much faster than MKL so I can speed that up on the GPU. But the interaction flux subroutine is too much slower to start with despite having no BLAS calls.

Topic		Replies	Views
poor pgi openmp performance?? Legacy PGI Compilers	17	20479	August 3, 2012
Performance question: PGI vs GCC vs Intel C++ Legacy PGI Compilers	6	16401	March 31, 2017
Performance using pgi and intel compiler Legacy PGI Compilers	5	21132	July 10, 2007
running pgf95 on the gfortran testsuite Legacy PGI Compilers	5	4582	January 30, 2018
Errors when building with PGI compiler Legacy PGI Compilers	10	15281	January 16, 2012
PGI is even slower than gcc? Legacy PGI Compilers	5	29591	July 1, 2009
Why pgfortran is slower than ifort? Legacy PGI Compilers	1	2428	January 24, 2018
Very slow performance of some loops Legacy PGI Compilers	3	2865	July 22, 2011
Compilation time Legacy PGI Compilers	2	2631	October 26, 2010
Survey for PGI FORTRAN compiler ï¼Thanks~ CUDA Programming and Performance	7	12586	July 27, 2010

PGI compiler much slower for a subroutine

Related topics