Launch of the kernel

Wentao_Guo · October 11, 2017, 9:33am

Dear all,
I’m a new user of OpenACC and I have some confusion when I accelerate my code. Previously the code is written in Fortran90+OpenMPI. In order to test the behavior of OpenACC, I only added OpenACC directives to one loop. Here is the part of my code:

!$acc data copyin(Betaexp,DeltaT,gravity,phi1,xsize,xstart) &
!$acc copy(tb1)
!$acc kernels
      do k=1,xsize(3)
      do j=1,xsize(2)
      do i=1,xsize(1)
         tb1(i,j,k)=tb1(i,j,k)+Betaexp*DeltaT*gravity*phi1(xstart(1)-1+i,xstart(2)-1+j,xstart(3)-1+k)
      enddo
      enddo
      enddo
!$acc end kernels
!$acc end data

The outer loop is for the time step. I run the code for 1000 time steps so this loop will be executed for 1000 times. I used the PGI compiler and here is the compiling information:

ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c decomp_2d.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c glassman.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c fft_generic.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c module_param.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c io.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c variables.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c poisson.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c schemes.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c convdiff.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c incompact3d.f90
PGF90-W-0155-Constant or Parameter used in data clause - betaexp (incompact3d.f90: 149)
PGF90-W-0155-Constant or Parameter used in data clause - deltat (incompact3d.f90: 149)
PGF90-W-0155-Constant or Parameter used in data clause - gravity (incompact3d.f90: 149)
0 inform, 3 warnings, 0 severes, 0 fatal for incompact3d
incompact3d:
149, Generating copy(tb1(:,:,:))
Generating copyin(xstart(:),xsize(:),phi1(:,:,:))
152, Loop is parallelizable
153, Loop is parallelizable
154, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
152, !$acc loop gang ! blockidx%y
153, !$acc loop gang, vector(4) ! blockidx%z threadidx%y
154, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c navier.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c filter.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c derive.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c parameters.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c tools.f90
ftn -acc -Minfo=accel -ta=tesla:cuda8.0 -cpp -DDOUBLE_PREC -c visu.f90
ftn -O3 -acc -ta=tesla:cuda8.0 -o incompact3d decomp_2d.o glassman.o fft_generic.o module_param.o io.o variables.o poisson.o schemes.o convdiff.o incompact3d.o navier.o filter.o derive.o parameters.o tools.o visu.o

When I submit the code to the cluster, I set PGI_ACC_TIME=1 and I got runtime summarization. Here is a part of it:

Accelerator Kernel Timing data
/scratch/snx3000/guow/Incompact3d_GPU/test3_copyin/source_code/incompact3d.f90
incompact3d NVIDIA devicenum=0
time(us): 1,000,835
149: data region reached 6000 times
149: data copyin transfers: 18000
device time(us): total=677,027 max=401 min=4 avg=37
160: data copyout transfers: 3000
device time(us): total=323,808 max=743 min=61 avg=107
151: compute region reached 3000 times
154: kernel launched 3000 times
grid: [4x21x8] block: [32x4]
elapsed time(us): total=410,046 max=2,449 min=47 avg=136

So my question is:

Is this loop calculated on GPU? From the runtime summarization I don’t see the device time after the line “154: kernel launched 3000 times”. I only see the data is copied in and out. So I doubt the loop is not even executed.
In the directive (!$acc data copyin(Betaexp,DeltaT,gravity,phi1,xsize,xstart)) I added to the code, I also copyin Betaexp,DeltaT and gravity which are constants I defined to the GPU. However the complier seems ignored those constants and only copyin xstart(:),xsize(:) and phi1(:,:,:) to the GPU. (See the compilation information:Generating copyin(xstart(:),xsize(:),phi1(:,:,:)))
So is it not necessary to copy constants to the GPU?
If I don’t use OpenACC directives and the code only use CPUs. It needs 0.28 seconds for each time step. If I added OpenACC directives to the loop shown above, each time step will take 0.61 seconds. Because the time loop cannot be paralleled on GPU, so for each time step, the loop shown above will be executed on the GPU and data movement will take up much more time than the gain from acceleration. So actually it is a waste of time to parallel such simple loops. So If I want to optimize my code, I need to use profiling programs and find the most computation heavy loops to parallel. Do I understand correctly?
Any suggestions will be appreciated! Thanks in advance!
Best regards,
Wentao Guo

MatColgrove · October 11, 2017, 5:33pm

Hi Wentao Guo,

Is this loop calculated on GPU?

Yes. From the PGI_ACC_TIME output you can see the kernel getting launched 3000 times, having a CUDA kernel launch configuration of a “4x21x8” grid size and a “32x4” block size. Each kernel averages about 136 microseconds for a total of ~0.4 seconds.

154: kernel launched 3000 times
grid: [4x21x8] block: [32x4]
elapsed time(us): total=410,046 max=2,449 min=47 avg=136

When you launched your application, do you see a warning that “libcupti.so” couldn’t be found? This is the device side profiling library (“device time”) and without it, the profile only shows the host side time (“elapsed time”). I’m not sure if you just didn’t copy the device time above or if it didn’t appear in the profile.

So is it not necessary to copy constants to the GPU?

Correct. The front-end compiler will replace constants with their literal value so putting them in a data clause is not needed.

Note that in most cases, you do not need to copy any scalar to the GPU. Scalars are firstprivate by default so will be copied by value to the GPU as part of the kernel arguments and be stored in a register. By putting these scalars into a data region, you have made them shared and placed then the GPU’s global memory. This can slow down the code a bit since now a pointer to the scalar is passed to the kernel and each thread will need to fetch the scalar’s value from global memory.

There are a few cases where you’ll want to put scalars in a “private” clause, but unless you have a specific reason why you want to make a scalar shared, it’s best practice to leave them out of data clauses.

Do I understand correctly?

You’re off a bit. Having an initial slow down when you first put compute directives is not uncommon. The problem being the high cost of data movement. Here you’re creating and moving the data every time you enter this routine.

What I suggest users do is take a top-down approach with their data, and a bottom-up iterative approach with compute. In other words, use unstructured data regions to move your arrays to the device early in the program. Then start iteratively adding compute regions around your loops and use “update” directives to ensure memory coherence between the host and device. Don’t worry about performance at this point. Once you have offloaded your compute, you can start removing the unneeded update directives and minimize data movement. I’d do this incrementally making sure you get correct answers. If you do start getting incorrect results, you’ll know that the update was needed or that you have more compute to offload.

For example:

! someplace at the beginning of your program
!$acc enter data create(xsize,xstart,tb1) 
...
! after xsize and xstart are initialized
!$acc update device(xsize,xstart)
...

! in your routine
!$acc update device(tb1)
!$acc kernels present(tb1,xsize,xstart)
      do k=1,xsize(3) 
      do j=1,xsize(2) 
      do i=1,xsize(1) 
         tb1(i,j,k)=tb1(i,j,k)+Betaexp*DeltaT*gravity*phi1(xstart(1)-1+i,xstart(2)-1+j,xstart(3)-1+k) 
      enddo 
      enddo 
      enddo 
!$acc end kernels 
!$acc update self(tb1)
....

! Someplace at the end of your program
!$acc exit data delete(xsize,xstart,tb1)

As you start offloading additional compute regions, you’ll be able to start removing or minimizing the updates.

Hope this helps,
Mat

Wentao_Guo · October 12, 2017, 7:02am

Hi Mat,
Thanks for your patient reply!
The device time for launching the kernel didn’t appear. And I didn’t see the warning about “libcupti.so”.
Your answer helps a lot. Thanks again.
Best regards,
Wentao Guo

MatColgrove · October 12, 2017, 1:04pm

The device time for launching the kernel didn’t appear. And I didn’t see the warning about “libcupti.so”.

Hmm. The only time “device time” isn’t shown is when libcupti.so didn’t get loaded.

Can you try profiling using PGPROF or NVPROF instead of PGI_ACC_TIME?

Wentao_Guo · October 18, 2017, 6:02am

Hi Mat. PGPROF and NVPROF cannot show the GPU details. I will check libcupti.so. Thanks!

Topic		Replies	Views
Fortran code not compiling for GPU Legacy PGI Compilers	11	7396	August 23, 2017
performance of PGI openacc directives Legacy PGI Compilers	9	5052	March 6, 2013
Getting Started with OpenACC Technical Blog	8	571	September 19, 2016
OpenACC for code acceleration Legacy PGI Compilers	13	10709	November 6, 2017
Fortran90 / OpenACC / Multi GPU Code Time measure nvc, nvc++ and nvfortran	7	616	February 26, 2024
An OpenACC Example (Part 1) Technical Blog	0	379	August 25, 2020
zero output when OpenACC is used Legacy PGI Compilers	0	8810	March 17, 2018
Runtime-problem-with-pgfortran and OpenACC CUDA Programming and Performance	1	303	October 7, 2019
Check performance Legacy PGI Compilers	4	3269	September 28, 2017
less speed of accelerator directives Legacy PGI Compilers	6	3520	March 26, 2012

Launch of the kernel

Related topics