simple multi-gpu test program not working

I wrote a short fortran program using OMP + ACC. All it does it set a(i)=i in parallel, so the array would read 1, 2, 3, 4…

The entire code and my compile command:
pgfortran -Minfo -mp -acc -o test test.f

      program test
      use OMP_LIB

      integer myid,i,N,chunk
      integer a(1:100)

      N = size(a)
      chunk=N/2       ! hardcoded for 2 OMP threads

      call omp_set_num_threads(2)

!$OMP PARALLEL PRIVATE(myid) SHARED(a)
      myid = OMP_GET_THREAD_NUM()
      call acc_set_device_num(myid,acc_device_nvidia)

!$acc kernels do 
      do i=myid*chunk+1,myid*chunk+chunk   ! 0th thread does first half
         a(i)=i
      enddo
!$OMP END PARALLEL

      end

At first I thought it was working correctly, because the array a has the expected values and the compiler output seemed okay. However, setting PGI_ACC_TIME=1 shows:

Accelerator Kernel Timing data
/home/ben/scratch/test.f
  test  thread=0  NVIDIA  devicenum=0
    time(us): 84
    16: compute region reached 1 time
        17: kernel launched 2 times
            grid: [1]  block: [64]
             device time(us): total=22 max=16 min=6 avg=11
            elapsed time(us): total=350 max=327 min=23 avg=175
        20: data copyout reached 2 times
             device time(us): total=62 max=43 min=0 avg=31
/home/ben/scratch/test.f
  test  thread=1  NVIDIA  devicenum=0
    time(us): 0
    16: compute region reached 1 time

or occasionally:

Accelerator Kernel Timing data
/home/ben/scratch/test.f
  test  thread=0  NVIDIA  devicenum=0
    time(us): 55
    16: compute region reached 1 time
        17: kernel launched 1 time
            grid: [1]  block: [64]
             device time(us): total=32 max=32 min=32 avg=32
            elapsed time(us): total=49 max=49 min=49 avg=49
        20: data copyout reached 1 time
             device time(us): total=23 max=23 min=23 avg=23
/home/ben/scratch/test.f
  test  thread=1  NVIDIA  devicenum=0
    time(us): 22
    16: compute region reached 1 time
        17: kernel launched 1 time
            grid: [1]  block: [64]
             device time(us): total=11 max=11 min=11 avg=11
            elapsed time(us): total=18 max=18 min=18 avg=18
        20: data copyout reached 1 time
             device time(us): total=11 max=11 min=11 avg=11

So it seems I am only using one of the two GPUs, since both threads show devicenum=0.

Compiler accelerator info:

test:
     12, Parallel region activated
     16, Generating present_or_copyout(a(myid*50+1:myid*50+50))
         Generating NVIDIA code
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
     17, Loop is parallelizable
         Accelerator kernel generated
         17, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
     20, Parallel region terminated

Any idea on why I am using only 1 GPU (apparently)?

The answer is simple, yet odd to me as to why it didn’t throw an error or warning. Try adding “use openacc” at the top of your code.

Without it (and with PGI_ACC_NOTIFY=3):

$ ./test
launch CUDA kernel  file=/home/mathomp4/F90Files/OMP-ACC/test.f function=test line=18 device=0 grid=1 block=64
download CUDA data  file=/home/mathomp4/F90Files/OMP-ACC/test.f function=test line=21 device=0 variable=a bytes=400
launch CUDA kernel  file=/home/mathomp4/F90Files/OMP-ACC/test.f function=test line=18 device=0 grid=1 block=64
download CUDA data  file=/home/mathomp4/F90Files/OMP-ACC/test.f function=test line=21 device=0 variable=a bytes=400

Accelerator Kernel Timing data
/home/mathomp4/F90Files/OMP-ACC/test.f
  test  thread=0  NVIDIA  devicenum=0
    time(us): 99
    17: compute region reached 1 time
        18: kernel launched 1 time
            grid: [1]  block: [64]
             device time(us): total=43 max=43 min=43 avg=43
            elapsed time(us): total=60 max=60 min=60 avg=60
        21: data copyout reached 1 time
             device time(us): total=56 max=56 min=56 avg=56
/home/mathomp4/F90Files/OMP-ACC/test.f
  test  thread=1  NVIDIA  devicenum=0
    time(us): 110
    17: compute region reached 1 time
        18: kernel launched 1 time
            grid: [1]  block: [64]
             device time(us): total=76 max=76 min=76 avg=76
            elapsed time(us): total=93 max=93 min=93 avg=93
        21: data copyout reached 1 time
             device time(us): total=34 max=34 min=34 avg=34

With ‘use openacc’:

$ ./test
launch CUDA kernel  file=/home/mathomp4/F90Files/OMP-ACC/test.f function=test line=18 device=1 grid=1 block=64
download CUDA data  file=/home/mathomp4/F90Files/OMP-ACC/test.f function=test line=21 device=1 variable=a bytes=400
launch CUDA kernel  file=/home/mathomp4/F90Files/OMP-ACC/test.f function=test line=18 device=0 grid=1 block=64
download CUDA data  file=/home/mathomp4/F90Files/OMP-ACC/test.f function=test line=21 device=0 variable=a bytes=400

Accelerator Kernel Timing data
/home/mathomp4/F90Files/OMP-ACC/test.f
  test  thread=0  NVIDIA  devicenum=0
    time(us): 89
    17: compute region reached 1 time
        18: kernel launched 1 time
            grid: [1]  block: [64]
             device time(us): total=46 max=46 min=46 avg=46
            elapsed time(us): total=64 max=64 min=64 avg=64
        21: data copyout reached 1 time
             device time(us): total=43 max=43 min=43 avg=43
/home/mathomp4/F90Files/OMP-ACC/test.f
  test  thread=1  NVIDIA  devicenum=1
    time(us): 80
    17: compute region reached 1 time
        18: kernel launched 1 time
            grid: [1]  block: [64]
             device time(us): total=45 max=45 min=45 avg=45
            elapsed time(us): total=62 max=62 min=62 avg=62
        21: data copyout reached 1 time
             device time(us): total=35 max=35 min=35 avg=35

I guess my question now is, what is the “correct” behavior of a program like this? Without ‘use openacc’ it definitely compiled and ran, just not as expected. If ‘use openacc’ is necessary for the program to run correctly, shouldn’t the compiler warn/error? Or is it running “correctly” in each case and is it caveat programmer?

Matt

Hi Matt, Ben,

Or is it running “correctly” in each case and is it caveat programmer?

Blame Fortran implicit typing. Without “use openacc”, the variable “acc_device_nvidia” is implicity declared as a real but has an undefined value. Perfectly legal Fortran code, just wrong. Adding “implicit none” would have found this problem.

% pgf90 test1.f90
PGF90-S-0038-Symbol, acc_device_nvidia, has not been explicitly declared (test1.f90)
  0 inform,   0 warnings,   1 severes, 0 fatal for test
  • Mat

Ah. Of course. I don’t deal with .f’s that often and my fingers type ‘implicit none’ by default now.

Thanks, Mat.

Mat

Thanks, Mat.

Mat

Ha! I’ve converted you to spelling your name with one T!