PGI compiler much slower for a subroutine

So I have a subroutine that goes something like this.

do tmp_index = 1, mesh%Ninterior_faces/chunksize
   do c_indx = 1, chunksize
          do var = 1, Nvar
             do p_indx = 1, P1
                
              tmpi_left(p_indx, var, c_indx) = &
                  data%uflux(left_indices_1 + (p_indx - 1), var, left)
              tmpi_rght(p_indx, var, c_indx) = &
                  data%uflux(rght_indices_1 + (p_indx - 1), var, rght)
            end do
        end do
     end do

   do c_indx = 1, chunksize
          do var = 1, Nvar
             do p_indx = 1, P1

             sfi_left = mesh%sfi(p_indx, left_face_index, left)
              sfi_rght = mesh%sfi(p_indx, rght_face_index, rght)

              uflux_left(p_indx, var, c_indx) = tmpi_left(sfi_left, var, c_indx)
              uflux_rght(p_indx, var, c_indx) = tmpi_rght(sfi_rght, var, c_indx)     
            end do
        end do
     end do

   call acc_evaluate_interaction_flux(uflux_left, uflux_rght, Fi, Fv)

   do c_indx = 1, chunksize
          do var = 1, Nvar
             do p_indx = 1, P1
               data%Fi(left_indices_1 + (sfi_left-1), var, left) = &
                                 Fi(p_indx, var, c_indx) +(Fv(p_indx, var, c_indx) + &
                                 ibeta_viscous*nx*Fv_dot_n)
         end do
      end do
   end do

end do

There are multiple subroutines in the code and I have converted two of them. The other subroutine goes much faster on the GPU than on the CPU after which I tried to convert this subroutine.

Now, the first problem. This particular subroutine runs much slower after compiling with gfortran than with PGI. It goes as much as 5 times slower.

The net result is that unlike the binary for gfortran, this particular subroutine becomes the costliest subroutine for the PGI binary.

Therefore even after putting the other subroutine on the GPU the code is much slower because of this subroutine.

The other problem is that while after some tuning I can get the subroutine to go faster on the GPU, but it goes faster only relative to the PGI run time, and is still much slower than the gfortran runtime.

Hi vsingh96824,

This particular subroutine runs much slower after compiling with gfortran than with PGI. It goes as much as 5 times slower

As written this line indicates that gfortran is slower, but given the rest of the post, I’m assuming you meant that PGI is 5x slower than gfortran?

Typically PGI is faster than gfortran and even when they are faster it only by 10%, not 5x. Hence, something seems off.

What flags are you using to compile?
Are you comparing CPU to CPU performance or CPU to GPU performance?
Have you profiled to code to see where the performance difference occurs?
Are both compilers executing the same code path?

  • Mat

Hi Mat,

Sorry for the typo. Yes, gfortran is faster by about 5 times for this particular subroutine only.

This comparison is for CPU only for 20 threads using OpenMP.

The exact same code is being compiled, one with PGI, one with gfortran.

I am using both compilers with debug mode, so

-g -mp for PGI

and

-g -fno-range-check -O1 -fbounds-check -fbacktrace -Wuninitialized -Wunused -ffpe-trap=invalid -finit-real=nan -fopenmp

for gfortran

I timed the code, so I know the performance difference for this subroutine. How would I profile it to determine why there’s such a large performance difference between the compilers.


If you want to check it out, the subroutine is get_interaction_flux at the following page.

https://bitbucket.org/vsingh001/deepfry/src/b744c86fd98aa8a72336fa0533cdc73b273713c3/src/cns/system.f90?at=accel&fileviewer=file-view-default

What’s the performance using optimization, such as “-fast -mp” for PGI and “-O3 -fopenmp” for GNU?

I’d also recommend setting the environment variables “MP_BIND=yes”.

Using “-g” without an optimization (-On) flag uses -O0, or no optimization at all. Even with -O2, -g will disable some optimization. If you do need debugging symbols in the binary but still want the code to be optimized, add the “-gopt” flag.

The change is negligible with change in flags for either case, specifically for this subroutine. The other subroutines go slightly faster.

Here’s how timings for PGI binary look like. First number is time and second number percentage of total time.

 ... Timings
        RHS Eval                            :  0.1912391186E+00
            Interpolation                   :  0.5394935608E-02     2.821
            Ucommon                         :  0.1595973969E-02     0.835
               Boundary Ucommon             :  0.4498958588E-03     0.235
            Gradients                       :  0.3165102005E-01    16.550
            Discontinuous Flux              :  0.8643150330E-02     4.520
            Discontinuous Flux              :  0.8138895035E-02     4.256
            Interaction Flux                :  0.8494186401E-01    44.417
               Boundary interaction flux    :  0.1007699966E-01     5.269
            Flux Divergence                 :  0.4641699791E-01    24.272

Here’s what they look like for gfortran

 ... Timings
        RHS Eval                            :  0.7886579633E-01
            Interpolation                   :  0.2599807456E-02     3.296
            Ucommon                         :  0.1370556653E-02     1.738
               Boundary Ucommon             :  0.1436527818E-03     0.182
            Gradients                       :  0.2036497183E-01    25.822
            Discontinuous Flux              :  0.3988858312E-02     5.058
            Discontinuous Flux              :  0.5238423124E-02     6.642
            Interaction Flux                :  0.1297307201E-01    16.450
               Boundary interaction flux    :  0.1285634935E-02     1.630
            Flux Divergence                 :  0.2927677147E-01    37.122

As you can see, the Interaction flux timing is way slower. Just to be sure that this was not happening because the problem size is too small, I tried with a much bigger problem size. And the difference remain.

I was able to build your application but am having trouble running it. Seems like I’m missing a hdf file (See below).

I noticed that you are compiling with “-pg” which will instrument your code with profile counters and can have a performance impact. It could also account why you’re not seeing any performance gains with optimization if the time is dominated by profiling.

Here’s the runtime error I’m getting:

examples/CNS/euler_vortex% main.Linux.pgfortran.gomp.atlas.exe vortex.in
 OPENING MESHFILE irreg.hdf5
HDF5-DIAG: Error detected in HDF5 (1.8.14) thread 0:
  #000: H5F.c line 604 in H5Fopen(): unable to open file
    major: File accessibilty
    minor: Unable to open file
  #001: H5Fint.c line 990 in H5F_open(): unable to open file: time = Tue Aug 23 14:52:13 2016
, name = 'irreg.hdf5', tent_flags = 1
    major: File accessibilty
    minor: Unable to open file
  #002: H5FD.c line 992 in H5FD_open(): open failed
    major: Virtual File Layer
    minor: Unable to initialize object
  #003: H5FDsec2.c line 343 in H5FD_sec2_open(): unable to open file: name = 'irreg.hdf5', errno = 2, error message = 'No such file or directory', flags = 1, o_flags = 2
    major: File accessibilty
    minor: Unable to open file
main.Linux.pgfortran.gomp.atlas.exe: H5I.c:1111: H5I_get_type: Assertion `ret_value >= H5I_BADID && ret_value < H5I_next_type' failed.
Abort

Hi Mat,

Sorry about that. I created a new branch with the HDF5 file.

The source is now at

https://bitbucket.org/vsingh001/deepfry/src/4e0b3659273a?at=gTest

The folder is

examples/CNS/tgv

Compilation and running is through

make COMP=PGI OMP=t DIM3=t
./main.Linux.PGI.debug.gomp.3d.exe input_16

I don’t get the -pg thing. I am not using it with the PGI compiler at all.

Ok, I was able to check out our code and ran it on my local server.

It looks like your time is dominated by your calls to DGEMM and explains why compiler optimization or even the compiler doesn’t matter. The Blas library we ship is from OpenBlas and quite fast but maybe not as fast as the one you’re using with gfortran?

Can you try linking the PGI binary with the same Blas library that you use with Gfortran and see if that makes a difference?

  • Mat

Hi Mat,

Thanks for the comments.

The time spent is indeed dominated by BLAS calls, but this is not true specifically for the Interaction flux subroutine that is taking a significant chunk of time in the PGI compiler case. There are no BLAS calls in this subroutine.

Also, I tried linking MKL with PGI for the BLAS part. There was almost no difference in speedup from default BLAS calls. This is surprising since, BLAS calls are much faster with gfortran and MKL.

But in general, I am not concerned about BLAS calls right now, because cuBLAS calls are much faster than MKL so I can speed that up on the GPU. But the interaction flux subroutine is too much slower to start with despite having no BLAS calls.