Hi,
I tried to recreate the Jacobi iteration example by running the code from PARALLEL FORALL NVIDIA blog.
I used the fortran file from step 2 folder but when I compiled and run the code I get the same time for the Openmp, OpenACC and serial version.
-Serial
$ pgf90 -fast -Mpreprocess -o laplace2d_f90_cpu laplace2d.f90
$ ./laplace2d_f90_cpu
Jacobi relaxation Calculation: 4096 x 4096 mesh
0 0.250000
100 0.002397
200 0.001204
300 0.000804
400 0.000603
500 0.000483
600 0.000402
700 0.000345
800 0.000302
900 0.000269
completed in 26.306 seconds
-Openmp
$ pgf90 -fast -mp -Minfo -Mpreprocess -o laplace2d_f90_omp laplace2d.f90
laplace:
43, Memory zero idiom, array assignment replaced by call to pgf90_mzero4
46, Loop not fused: dependence chain to sibling loop
4 loops fused
Generated vector simd code for the loop
Generated a prefetch instruction for the loop
50, Array assignment / Forall at line 51 fused
Loop not fused: function call before adjacent loop
Generated vector simd code for the loop
Generated a prefetch instruction for the loop
63, Parallel region activated
64, Parallel loop activated with static block schedule
Loop not vectorized: may not be beneficial
Unrolled inner loop 8 times
Generated 8 prefetches in scalar loop
Generated 1 prefetches in scalar loop
67, Parallel region terminated
70, Parallel region activated
71, Parallel loop activated with static block schedule
Generated vector simd code for the loop
Generated a prefetch instruction for the loop
74, Parallel region terminated
81, Parallel region activated
83, Parallel loop activated with static block schedule
84, Generated vector simd code for the loop containing reductions
Generated 3 prefetch instructions for the loop
89, Begin critical section
End critical section
Parallel region terminated
96, Parallel region activated
98, Parallel loop activated with static block schedule
99, Memory copy idiom, loop replaced by call to __c_mcopy4
102, Parallel region terminated
$ ./laplace2d_f90_omp
Jacobi relaxation Calculation: 4096 x 4096 mesh
0 0.250000
100 0.002397
200 0.001204
300 0.000804
400 0.000603
500 0.000483
600 0.000402
700 0.000345
800 0.000302
900 0.000269
completed in 25.705 seconds
-OpenACC
$pgf90 -acc -ta=nvidia -Minfo=accel -Mpreprocess -o laplace2d_f90_acc laplace2d.f90
laplace:
77, Generating copy(anew(:,:),a(:,:))
83, Loop is parallelizable
84, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
83, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
84, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
87, Generating implicit reduction(max:error)
98, Loop is parallelizable
99, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
98, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
99, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
$ ./laplace2d_f90_acc
Jacobi relaxation Calculation: 4096 x 4096 mesh
0 0.250000
100 0.002397
200 0.001204
300 0.000804
400 0.000603
500 0.000483
600 0.000402
700 0.000345
800 0.000302
900 0.000269
completed in 26.066 seconds
I can see in NVIDIA X server settings that the GPU utilization is up to 100% in the OpenACC version.
The speedup according to blog should be something like this:
Execution Time (s) Speedup vs. 1 CPU Thread Speedup vs. 4 CPU Threads
CPU 1 thread 34.14 — —
CPU 4 threads (OpenMP) 21.16 1.61x 1.0x
GPU (OpenACC) 9.02 3.78x 2.35x
I understand that the setup configuration is different in the blog (it is also compiled with OpenACC 1.0) but I thought that I should have an equivalent speedup.
I use Linux Ubuntu 16.04, Intel® Core™ i5-5250U CPU @ 1.60GHz × 4 and GeForce 920M/PCIe/SSE2.
Thanks,
Alex