Runtime-problem-with-pgfortran and OpenACC

We have a HPC server Apollo XL190r gen9 equipped with 1 processor Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz and 2 accelerators NVIDIA K40 used for parallel programming.
So we use PGFORTRAN for compilation.

Trying to compile two fortran code : one with acc directives and one without acc directives but the execution time of sequential code is lower than the parallel one (see below for more details)

without acc directives

[instm@localhost step1]$ pgfortran -acc -ta=nvidia -Minfo=accel laplace2d.f90 -o lpc
[instm@localhost step1]$ time ./lpc
Jacobi relaxation Calculation: 4096 x 4096 mesh
0 0.250000
100 0.002397
200 0.001204
300 0.000804
400 0.000603
500 0.000483
600 0.000403
700 0.000345
800 0.000302
900 0.000269
completed in 53.059 seconds

real 0m53.094s
user 0m53.035s
sys 0m0.053s

with acc directives

[instm@localhost step1]$ pgfortran -acc -ta=nvidia -Minfo=accel laplace2d.f90 -o lpc
laplace:
75, Generating implicit copyout(anew(1:4094,1:4094))
Generating implicit copyin(a(0:4095,0:4095))
76, Loop is parallelizable
77, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
76, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
77, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
80, Generating implicit reduction(max:error)
90, Generating implicit copyin(anew(1:4094,1:4094))
Generating implicit copyout(a(1:4094,1:4094))
91, Loop is parallelizable
92, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
91, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
92, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
[instm@localhost step1]$ time ./lpc
Jacobi relaxation Calculation: 4096 x 4096 mesh
0 0.250000
100 0.002397
200 0.001204
300 0.000804
400 0.000603
500 0.000483
600 0.000403
700 0.000345
800 0.000302
900 0.000269
completed in 84.195 seconds

real 1m24.346s
user 1m17.734s
sys 0m6.616s
[instm@localhost step1]$

You appear to be working through a pretty standard educational code sequence that I am familiar with.

This is an expected outcome for the initial porting of the jacobi iteration loop.

You need to continue the exercise to use the data directives so that data is not copied between host and device on every iteration of the while-loop.