Check performance

Hi,

I tried to recreate the Jacobi iteration example by running the code from PARALLEL FORALL NVIDIA blog.
https://github.com/parallel-forall/code-samples/tree/master/posts/002-openacc-example

I used the fortran file from step 2 folder but when I compiled and run the code I get the same time for the Openmp, OpenACC and serial version.
-Serial

$ pgf90 -fast -Mpreprocess -o laplace2d_f90_cpu laplace2d.f90
$ ./laplace2d_f90_cpu 
Jacobi relaxation Calculation: 4096 x 4096 mesh
    0  0.250000
  100  0.002397
  200  0.001204
  300  0.000804
  400  0.000603
  500  0.000483
  600  0.000402
  700  0.000345
  800  0.000302
  900  0.000269
 completed in     26.306 seconds

-Openmp

$ pgf90  -fast -mp -Minfo -Mpreprocess -o laplace2d_f90_omp laplace2d.f90
laplace:
     43, Memory zero idiom, array assignment replaced by call to pgf90_mzero4
     46, Loop not fused: dependence chain to sibling loop
         4 loops fused
         Generated vector simd code for the loop
         Generated a prefetch instruction for the loop
     50, Array assignment / Forall at line 51 fused
         Loop not fused: function call before adjacent loop
         Generated vector simd code for the loop
         Generated a prefetch instruction for the loop
     63, Parallel region activated
     64, Parallel loop activated with static block schedule
         Loop not vectorized: may not be beneficial
         Unrolled inner loop 8 times
         Generated 8 prefetches in scalar loop
         Generated 1 prefetches in scalar loop
     67, Parallel region terminated
     70, Parallel region activated
     71, Parallel loop activated with static block schedule
         Generated vector simd code for the loop
         Generated a prefetch instruction for the loop
     74, Parallel region terminated
     81, Parallel region activated
     83, Parallel loop activated with static block schedule
     84, Generated vector simd code for the loop containing reductions
         Generated 3 prefetch instructions for the loop
     89, Begin critical section
         End critical section
         Parallel region terminated
     96, Parallel region activated
     98, Parallel loop activated with static block schedule
     99, Memory copy idiom, loop replaced by call to __c_mcopy4
    102, Parallel region terminated
$ ./laplace2d_f90_omp
Jacobi relaxation Calculation: 4096 x 4096 mesh
    0  0.250000
  100  0.002397
  200  0.001204
  300  0.000804
  400  0.000603
  500  0.000483
  600  0.000402
  700  0.000345
  800  0.000302
  900  0.000269
 completed in     25.705 seconds

-OpenACC

$pgf90  -acc -ta=nvidia -Minfo=accel -Mpreprocess -o laplace2d_f90_acc laplace2d.f90
laplace:
     77, Generating copy(anew(:,:),a(:,:))
     83, Loop is parallelizable
     84, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         83, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
         84, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
         87, Generating implicit reduction(max:error)
     98, Loop is parallelizable
     99, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         98, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
         99, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
$ ./laplace2d_f90_acc 
Jacobi relaxation Calculation: 4096 x 4096 mesh
    0  0.250000
  100  0.002397
  200  0.001204
  300  0.000804
  400  0.000603
  500  0.000483
  600  0.000402
  700  0.000345
  800  0.000302
  900  0.000269
 completed in     26.066 seconds

I can see in NVIDIA X server settings that the GPU utilization is up to 100% in the OpenACC version.
The speedup according to blog should be something like this:
Execution Time (s) Speedup vs. 1 CPU Thread Speedup vs. 4 CPU Threads
CPU 1 thread 34.14 — —
CPU 4 threads (OpenMP) 21.16 1.61x 1.0x
GPU (OpenACC) 9.02 3.78x 2.35x

I understand that the setup configuration is different in the blog (it is also compiled with OpenACC 1.0) but I thought that I should have an equivalent speedup.

I use Linux Ubuntu 16.04, Intel® Core™ i5-5250U CPU @ 1.60GHz × 4 and GeForce 920M/PCIe/SSE2.

Thanks,
Alex

Hi Alex,

I’m thinking that the OpenACC version isn’t actually running on the GPU. I just tried the code on a P100 and it runs in under a second versus 25 seconds on the CPU.

What type of device do you have and which version of the compilers are you using?

With the more recent PGI 2017 compiler versions, “-ta=tesla” (which is the new name for “-ta=nvidia” but “-ta=nvidia” is still fine to use) defaults to using CUDA 7.5 and targets Fermi, Kepler, and Maxwell architectures. Since Pascal requires CUDA 8.0, it is not included unless you specify “-ta=tesla:cuda8.0” and/or “-ta=tesla:cc60”. So if you have a Pascal device, this would explain the issue.

You can tell if the binary actually ran on the GPU by setting the environment variable PGI_ACC_NOTIFY=1, which will show you all the kernel launches, or PGI_ACC_TIME=1, which will give a simple profile of the GPU code at the end of the run. If these do not produce any output, you’ll know that you’re not running on the GPU.

Of course if PGI_ACC_TIME does show that you’re running on the GPU and still get the poor time, please post the output I can see where poor performance is coming from.

-Mat

% pgfortran -fast laplace2d.F90 -o laplace2d_f90_cpu
% pgfortran -fast -ta=tesla:cc60 laplace2d.F90 -o laplace2d_f90_acc
% laplace2d_f90_acc
Jacobi relaxation Calculation: 4096 x 4096 mesh
    0  0.250000
  100  0.002397
  200  0.001204
  300  0.000804
  400  0.000603
  500  0.000483
  600  0.000402
  700  0.000345
  800  0.000302
  900  0.000269
 completed in      0.839 seconds
% laplace2d_f90_cpu
Jacobi relaxation Calculation: 4096 x 4096 mesh
    0  0.250000
  100  0.002397
  200  0.001204
  300  0.000804
  400  0.000603
  500  0.000483
  600  0.000402
  700  0.000345
  800  0.000302
  900  0.000269
 completed in     24.544 seconds

Hi Mat,

I use a laptop dell 5558 and the PGI compiler is 17.4 community edition.
The Nvidia Geforce 920m has Kepler architecture and also below is the info from pgaccelinfo:

CUDA Driver Version:           8000
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  375.66  Mon May  1 15:29:16 PDT 2017

Device Number:                 0
Device Name:                   GeForce 920M
Device Revision Number:        3.5
Global Memory Size:            2101542912
Number of Multiprocessors:     2
Number of SP Cores:            384
Number of DP Cores:            128
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    954 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             900 MHz
Memory Bus Width:              64 bits
L2 Cache Size:                 524288 bytes
Max Threads Per SMP:           2048
Async Engines:                 1
Unified Addressing:            Yes
Managed Memory:                Yes
PGI Compiler Option:           -ta=tesla:cc35

This is the output from the PGI_ACC_TIME

./gpu
Jacobi relaxation Calculation: 4096 x 4096 mesh
    0  0.250000
  100  0.002397
  200  0.001204
  300  0.000804
  400  0.000603
  500  0.000483
  600  0.000402
  700  0.000345
  800  0.000302
  900  0.000269
 completed in     26.706 seconds

Accelerator Kernel Timing data
/home/alex/Desktop/samples/laplace/laplace2d.f90
  laplace  NVIDIA  devicenum=0
    time(us): 26,194,081
    77: data region reached 2 times
        77: data copyin transfers: 8
             device time(us): total=86,127 max=10,791 min=10,713 avg=10,765
        107: data copyout transfers: 10
             device time(us): total=80,317 max=10,056 min=17 avg=8,031
    82: compute region reached 1000 times
        82: data copyin transfers: 1000
             device time(us): total=7,518 max=83 min=3 avg=7
        84: kernel launched 1000 times
            grid: [128x1024]  block: [32x4]
             device time(us): total=14,574,062 max=25,688 min=13,893 avg=14,574
            elapsed time(us): total=14,689,060 max=30,389 min=13,912 avg=14,689
        84: reduction kernel launched 1000 times
            grid: [1]  block: [256]
             device time(us): total=292,993 max=308 min=290 avg=292
            elapsed time(us): total=385,000 max=4,279 min=307 avg=385
        84: data copyout transfers: 1000
             device time(us): total=256,249 max=2,648 min=13 avg=256
    97: compute region reached 1000 times
        99: kernel launched 1000 times
            grid: [128x1024]  block: [32x4]
             device time(us): total=10,896,815 max=21,066 min=10,399 avg=10,896
            elapsed time(us): total=10,985,253 max=23,325 min=10,418 avg=10,985

Maybe is it the Nvidia Optimus technology the reason for the delay?
Thank you very much,
Alex

While possible that it’s something to do with Optimus, I’ve never used it so don’t know what impact it would have.

More, I would guess it’s your card. It only has 2 multiprocessors and your clock rate is slower (954 MHz) so is going to take considerably longer to process. In contrast, while Mark’s device in the blog was and older architecture (he used a Telsa M2090), it has 16 multiprocessors at a clock rate of 1301 MHz. My P100 has 56 at 1328 MHz. Also, the code is compute bound so will run faster with more resources.

The closest device that I have to yours is a GTX 970 with 8 multiprocessors running at 1019 MHz which runs the laplace acc example in 3.31 seconds.

Granted, the number of multiprocessors and clock speed doesn’t account for all the difference since a simple estimate would put your time in around 16 seconds, not 26, but the card is definitely a factor.

The good news while not fast, the code does still run. Hence, you should be able to develop on this system. Though when it comes time for production use, I’d definitely recommend investing in a higher end card such as a Tesla P100.

-Mat

Okay I am gonna try the laplace example in a desktop computer with GTX 1050 tomorrow. If I find something anything new about the 920M GPU I will post again.

Thanks for your help Mat
Alex