I wrote a short fortran program using OMP + ACC. All it does it set a(i)=i in parallel, so the array would read 1, 2, 3, 4…
The entire code and my compile command:
pgfortran -Minfo -mp -acc -o test test.f
program test
use OMP_LIB
integer myid,i,N,chunk
integer a(1:100)
N = size(a)
chunk=N/2 ! hardcoded for 2 OMP threads
call omp_set_num_threads(2)
!$OMP PARALLEL PRIVATE(myid) SHARED(a)
myid = OMP_GET_THREAD_NUM()
call acc_set_device_num(myid,acc_device_nvidia)
!$acc kernels do
do i=myid*chunk+1,myid*chunk+chunk ! 0th thread does first half
a(i)=i
enddo
!$OMP END PARALLEL
end
At first I thought it was working correctly, because the array a has the expected values and the compiler output seemed okay. However, setting PGI_ACC_TIME=1 shows:
Accelerator Kernel Timing data
/home/ben/scratch/test.f
test thread=0 NVIDIA devicenum=0
time(us): 84
16: compute region reached 1 time
17: kernel launched 2 times
grid: [1] block: [64]
device time(us): total=22 max=16 min=6 avg=11
elapsed time(us): total=350 max=327 min=23 avg=175
20: data copyout reached 2 times
device time(us): total=62 max=43 min=0 avg=31
/home/ben/scratch/test.f
test thread=1 NVIDIA devicenum=0
time(us): 0
16: compute region reached 1 time
or occasionally:
Accelerator Kernel Timing data
/home/ben/scratch/test.f
test thread=0 NVIDIA devicenum=0
time(us): 55
16: compute region reached 1 time
17: kernel launched 1 time
grid: [1] block: [64]
device time(us): total=32 max=32 min=32 avg=32
elapsed time(us): total=49 max=49 min=49 avg=49
20: data copyout reached 1 time
device time(us): total=23 max=23 min=23 avg=23
/home/ben/scratch/test.f
test thread=1 NVIDIA devicenum=0
time(us): 22
16: compute region reached 1 time
17: kernel launched 1 time
grid: [1] block: [64]
device time(us): total=11 max=11 min=11 avg=11
elapsed time(us): total=18 max=18 min=18 avg=18
20: data copyout reached 1 time
device time(us): total=11 max=11 min=11 avg=11
So it seems I am only using one of the two GPUs, since both threads show devicenum=0.
Compiler accelerator info:
test:
12, Parallel region activated
16, Generating present_or_copyout(a(myid*50+1:myid*50+50))
Generating NVIDIA code
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
Generating compute capability 3.0 binary
17, Loop is parallelizable
Accelerator kernel generated
17, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
20, Parallel region terminated
Any idea on why I am using only 1 GPU (apparently)?