Fortran90 / OpenACC / Multi GPU Code Time measure

Hi everyone,

I’m trying to compare the execution time of my code between CPU version and GPU one. The code is written in Fortran90 and compiled with mpif90 under nvhpc23.1. The GPU acceleration is performed with OpenACC and it is a multi GPU code, where a single CPU is set with a single GPU. In total I’m running with 2 CPUs and 2 GPUs and therefore my launch command is:

mpirun -np 2 ./program

The code is very complex since it is a CFD code and I’m reporting only the part that I’m accelerating. This is the kinetic part where I calculate the reaction rate of the species involved in the calculation. Now I’m only transferring data forward and back from CPUs to GPUs and viceversa to estimate the time of this data movement. Therefore the code is very simple because inside the triple loop there is almost nothing. The code is the following:

USE global_mod, ONLY: TFlame, NsMAX, num_zones, zones, MINi, MAXi, MINj, MAXj, MINk, MAXk
USE common_alloc
USE openacc
USE common_mpi!, ONLY: my_gpu, device_type, my_id

INTEGER, VALUE :: B, i, j, k, l, ll, s, icomp
!$acc declare create(B, i, j, k, l, ll, s, icomp)

!$acc declare create(OM(:), Yi_ijk(:), Xir(:))

DOUBLE PRECISION, VALUE :: T_ijk, p_ijk, dens, Rgst,S_y, Wmix_ijk, D_hp,rho_ijk
!$acc declare create(T_ijk, p_ijk, dens, Rgst,S_y, Wmix_ijk, D_hp,rho_ijk)

!$acc declare create(VREAZ(:))


!$acc declare create(Kchem, Kchem_B, PRODF, PRODB, KEQ)

!$acc declare create(NIF, NIB, NII)

!$acc declare create(SOMMAH,SOMMAS,TEM,app,Y(:))

DOUBLE PRECISION, VALUE :: tmpsum,tmpsum2
!$acc declare create(tmpsum,tmpsum2)

!$acc declare copyin(my_gpu)
!$acc declare create(MINi(:),MAXi(:),MINj(:),MAXj(:),MINk(:),MAXk(:))

integer(c_size_t) :: free_mem, total_mem
!$acc declare create(free_mem, total_mem)

DOUBLE PRECISION :: start_time, end_time, elapsed_time

start_time = MPI_WTIME()

omega = 0.
!$acc enter data copyin(omega(:,:,:,:))
!$acc enter data copyin(Yi(:,:,:,:),T(:,:,:),p(:,:,:),rho(:,:,:))

 !$acc update device(MINi,MAXi,MINj,MAXj,MINk,MAXk)
 !$acc update device(omega)
 !$acc loop seq 
 do k= MINk(BBB)-(Ghost-1), MAXk(BBB)+(Ghost-1)
  do j= MINj(BBB)-(Ghost-1), MAXj(BBB)+(Ghost-1)
   do i= MINi(BBB)-(Ghost-1), MAXi(BBB)+(Ghost-1)

    p_ijk  = p(i,j,k)
    T_ijk  = T(i,j,k)
    Yi_ijk = Yi(:,i,j,k)+ 1.0d-20

   end do ! Loop su i
  end do ! Loop su j
 end do ! Loop su k
 !$acc update self(omega)
 !$acc exit data delete(p,T,Yi,rho,omega)

end_time = MPI_WTIME()
elapsed_time = end_time - start_time
if (my_id == 0) then
   print *, "Elapsed time (seconds): ", elapsed_time
end if


When running with GPUs I compile the entire code with the following flags:

mpif90 -r8 -acc=gpu,noautopar -target=gpu -gpu=cc80,managed -Mpreprocess -Mfree -Mextend -Munixlogical -Mbyteswapio -traceback -Mchkstk -Mnostack_arrays -Mnofprelaxed -Mnofpapprox -Minfo=accel

while when I run the code with the CPUs only I compile in this way:

mpif90 -r8 -Mpreprocess -Mfree -Mextend -Munixlogical -Mbyteswapio -traceback -Mchkstk -Mnostack_arrays -Mnofprelaxed -Mnofpapprox

The issue that I found is that when I run the code (with a 2D dimensional case with a grid of 2000x2000 cells) the GPU code is faster than the CPU one!!! Since inside the loop there is nothing, the GPU code should be slower because of the data movement. The code need 370 seconds per iteration and with the CPU the part that I’m estimating last about 2.46 seconds. With the GPU code the overall time is always around 370 seconds (for real is less) but the triple loop last only 0.2 seconds. This is not possible. In this code that you see the acceleration is not performed since is !$acc loop seq but I also tried with !$acc parallel loop gang vector collapse(3) and the result is the same (indeed the time is different but alway less than CPU execution).

This is very strange but I’m sure I did something wrong. Thank you all!


Have you used NVIDIA’s Nsight Systems to profile your code? That free tool can be used to do a real deep dive into all the data transfers and OpenACC kernels that are going on. It looks like you’re using managed memory (i.e. -gpu=managed) - which in most cases I would expect most of your data directives to not actually do anything - because managed memory essentially farms out the data transfers to the operating system and moves data as needed. If you run without managed, I would expect a pretty different performance profile than with it. But it could easily be that there aren’t as many data transfers as you expect because “-gpu=managed” is on, and the arrays you’re interested in page fault quickly and easy to the GPU - possibly earlier in the codes run.

I’m not sure what else I can explicitly say, performance wise, without a profile or a reproducer. But - as some problem-solving hints, I have some ideas on top of profiling with nsight systems. You could also look into adding nvtx compiler flags and adding some nvtx regions to your code - which would allow nsight profiles to give you an idea of specific performance in specific regions for CPU versus GPU, without relying only on manual timers.

You could use NV_ACC_NOTIFY=2 to have the system dump out data transfers on the GPU runs. Combined with some print statements, this might also provide information into places where data is actually being transferred, versus where you think it is (i.e. with managed, there shouldn’t be many explicit data transfers, as we’d expect most things to page fault back and forth - but with explicit data management, everything will work like you think it does). Similar NV_ACC_NOTIFY=1 will confirm that the kernels are launching where you expect and properly.

Also - you might check/verify that you’re getting the right answer in both CPU, and GPU (managed and explicit) modes. So maybe in your inner loop, instead of setting p_ijk, you do p(i,j,k) = p(i,j,k) + 1 (and something similar to the others) and then check that the values are correct afterwards.

Someone else may come and look at this and know the immediate right answer, but from where you’re sitting at now - I can’t give you a direct answer without more input. If you can take your subroutine and carve your program down to a smaller reproducer that shows the same issue you think you’re seeing - I can go profile it and try to understand and explain the behavior. But absent that - I think I’ve recommended a lot of good problem solving tools to explain this yourself - and my highest recommendation is to profile the code yourself and learn how to use nsight systems. It will make your life a lot easier with questions like this.

1 Like

I’m not seeing any optimization flags being applied hence in the CPU case the default “-O” flag, i.e. lower opt, is being applied. However given OpenACC needs a higher level of compiler analysis, -acc implies “-O2”.

Hence what might be happening is that at -O2, the compiler is able to apply dead-code elimination given the result variables aren’t actually used and remove the loop altogether.

Try adding “-O2” or better yet, “-Ofast”, to both compilations to see if the behavior changes.

I’d also use the values from “p_ijk”, “T_ijk”, and “Yi_ijk” someplace after the loop, like in a print statement, so the compiler doesn’t remove the loop.

1 Like

Hi @scamp1, thanks for all your suggestions. I haven’t profiled the code yet but it will be the next thing I do. I’m waiting for the ICT support to install Nsight Systems on our cluster. In the meanwhile, I performed some tests removing the -gpu=managed flag and now the things are much more better. Moreover, I realized that I was comparing the CPU code compiled with no optimizations with the GPU code with -acc=gpu, that how @MatColgrove said it implies -O2 optimization. Now with your suggestions it all works and therefore I consider this issue solved from my point of view.

Just to give you some numbers and for the sake of completeness, I would like to say that on a 2D cartesian grid of 2000x2000 cells, the results I obtain with the complete code (kinetics not commented inside the triple loop kernel) are the following:

CPU —> 85.3 seconds
GPU —> 91.6 seconds

Therefore, there are 6.3 seconds that are used from the GPU to pass data (copyin + copyout).

Thank you again for your great patience, time and support. I really appreciate it. @MatColgrove and @scamp1 I certainly owe you a special mention within my next scientific peer-reviewed paper on this topic. I hope to engage in a great collaboration with you!!

I had another problem with the use of the debugger when printing an array. The code is the same of this topic, that’s why I’m not opening another one. For now I’m just working with the CPU version of the code but later I will do the same for the GPU one. I compile the CPU code (under nvhpc 23.1) with the following flags:

-g -gopt -r8 -acc=gpu -target=gpu,noautopar -gpu=cc80 -Mpreprocess -Mfree -Mextend -Munixlogical -Mbyteswapio -traceback -Mchkstk -Mnostack_arrays -Mnofprelaxed -Mnofpapprox -Minfo=accel

and I run the code with:

mpirun -np 2 xterm -e cuda-gdb ./program

After this, I break the code at a certain line where there is: call kinetics. Then I run the code on both terminal and I step into kinetics subroutine. Then I break again after Y_ijk and I try to print all the variables I obtained so far. For p_ijk and T_ijk there is no problem, since they are scalars. When I try to print an element of the Yi_ijk array the debugger gives me the following error:

No symbol “Yi_ijk” in current context.

@MatColgrove or @scamp1 do you have any idea for this issue?

Thank you again!!

This isn’t uncommon. In order to offload, the compiler must still use optimization, even with -g, and optimized code can be more difficult to debug. “Y_ijk” may be being put in a register so no longer named.

Therefore, there are 6.3 seconds that are used from the GPU to pass data (copyin + copyout).

Yes, copying data back and forth with each entry into the function can be expensive. If you can, you’ll want to move the data regions to a higher point in the code, try to move all computation on these arrays to the device, and then use the “update” directive to synchronize the data but only when necessary. The ideal is to create/copy the data to the device only once at the beginning of the program, and then only copy back the results at the end.

1 Like

Building off what Mat said - when you profile it, profile it in both non-managed and -gpu=managed mode. The non-managed profiling will definitely let you see where you might hoist data transfers up in the stack to optimize performance, whilst seeing and understanding the difference in managed vs. non-managed approaches and programming models can be helpful to your future work.

In managed, a lot of the data transfers and synchronizations that you have clearly spent time programming into your code become unnecessary. Plus, we have recently introduced -gpu=unified for compatible systems - which is a further upgrade to -gpu=managed, making it even easier to use (need NVHPC 23.11 or better for unified and a compatible system). It may not be something your code or infrastructure is ready for now - but it could provide a big boost later on if you move to a compatible system and could greatly simplify future GPU porting efforts.

With regards to the debugging call - I defer to Mat on that. I’m not very experienced in that topic.

1 Like