Multi-GPU Fortran OpenACC and OpenMP

Hello Mat,
Hope you are doing well…
I have a problem where I wish to run two GPUs using OpenACC in Fortran while using OpenMP. Here is my code:

Program TEST
use openacc
use omp_lib
IMPLICIT NONE
INTEGER :: ngpus
integer, value :: devicenum
integer(acc_device_kind), value :: devicetype
integer(acc_device_property):: property
character*(1000) :: string
REAL*8, Allocatable :: A(:,:),B(:,:),C(:,:)
INTEGER :: i,j,n
devicetype = acc_get_device_type()
ngpus = acc_get_num_devices(devicetype)
Write(*,*) 'Number of ACC devices are: ', ngpus
do devicenum=1,ngpus
   property=acc_property_driver
   call acc_get_property_string(devicenum,devicetype,property,string)
   Write(*,*) trim(string)
   property=acc_property_name
   call acc_get_property_string(devicenum,devicetype,property,string)
   Write(*,*) trim(string)
enddo
N=1000
Allocate(A(n,n),B(n,n),C(n,n))
A=0
B=0
C=0
call omp_set_num_threads(2)
!$omp parallel private(i,j)
!$omp sections
!$omp section
call acc_set_device_num(1,acc_device_nvidia)
!$acc data copyout(A)
!$acc parallel loop async
Do i = 1 , n
Do j = 1 , n
A(i,j)=(2*i+3*j)/6
end do
end do
!$acc wait
!$acc end data
!$omp section
call acc_set_device_num(2,acc_device_nvidia)
!$acc data copyout(B)
!$acc parallel loop async
Do i = 1 , n
Do j = 1 , n
B(i,j)=(2*i+3*j)/12
end do
end do
!$acc wait
!$acc end data
!$omp end sections
!$omp end parallel
call acc_set_device_num(1,acc_device_nvidia)
!$acc data copyin(A,B) copyout(C)
!$acc parallel loop
Do i = 1 , n
Do j = 1 , n
C(i,j)= A(i,j) - B(i,j) 
end do
end do
!$acc wait
!$acc end data
WRITE(*,*) C(50,50)
End Program

And I am compiling it with:

!      set ACC_DEVICE_TYPE=nvidia
!  % pgf90 -mp -ta=tesla:cc35,cc50 test2.f90 -Minfo=all

There problem is, it appears that I have no idea how to use OpenMP in Fortran. I am using the omp parallel and sections, however, what I get is that the parts where I wish for array [A] to be solved on GPU1 and array to be solved on GPU2 …end up all being solved on GPU2…
I have checked that is no parallelism through PGPROF:
https://www.photobox.co.uk/my/photo/full?photo_id=501190025887
https://www.photobox.co.uk/my/photo/full?photo_id=501190025121
I wish to calculate both [A] and at the same time, each on a different GPU. Kindly give me an idea where I went wrong…
Thank you.
Ahmed

Hi Ahmed,

I couldn’t view your screen shots but was able to run and profile your code. It’s working as expected but because the device initialization time for the seconds GPU is longer than the first kernel’s compute time, it makes it so they aren’t running concurrently.

To fix, add the device initialization before you enter the compute loop. i.e. something like:

 do devicenum=1,ngpus
   property=acc_property_driver
   call acc_get_property_string(devicenum,devicetype,property,string)
   Write(*,*) trim(string)
   property=acc_property_name
   call acc_get_property_string(devicenum,devicetype,property,string)
   Write(*,*) trim(string)
   call acc_set_device_num(devicenum,acc_device_nvidia)
enddo

Also, using OpenMP isn’t really necessary. You can use pure OpenACC. For example:

Program TEST
use openacc
use omp_lib
IMPLICIT NONE
INTEGER :: ngpus
integer, value :: devicenum
integer(acc_device_kind), value :: devicetype
integer(acc_device_property):: property
character*(1000) :: string
REAL*8, Allocatable :: A(:,:),B(:,:),C(:,:)
INTEGER :: i,j,n
devicetype = acc_get_device_type()
ngpus = acc_get_num_devices(devicetype)
Write(*,*) 'Number of ACC devices are: ', ngpus
ngpus=2
do devicenum=1,ngpus
   property=acc_property_driver
   call acc_get_property_string(devicenum,devicetype,property,string)
   Write(*,*) trim(string)
   property=acc_property_name
   call acc_get_property_string(devicenum,devicetype,property,string)
   Write(*,*) trim(string)
   call acc_set_device_num(devicenum,acc_device_nvidia)
enddo
N=10000
Allocate(A(n,n),B(n,n),C(n,n))
call acc_set_device_num(1,acc_device_nvidia)
!$acc enter data create(A,C) async
!$acc parallel loop present(A) async
Do i = 1 , n
Do j = 1 , n
A(i,j)=(2*i+3*j)/6
end do
end do

call acc_set_device_num(2,acc_device_nvidia)
!$acc enter data create(B) async
!$acc parallel loop present(B) async
Do i = 1 , n
Do j = 1 , n
B(i,j)=(2*i+3*j)/12
end do
end do
!$acc wait
!$acc exit data copyout(B)
call acc_set_device_num(1,acc_device_nvidia)
!$acc enter data copyin(B)
!$acc parallel loop present(A,B,C)
Do i = 1 , n
Do j = 1 , n
C(i,j)= A(i,j) - B(i,j)
end do
end do
!$acc exit data copyout(C) delete(A,B)
WRITE(*,*) C(50,50)
End Program

Hope this helps,
Mat

Thank you so much!
This is perfect.
Ahmed :D