combine the OpenMP with the OpenACC

I program a simple code to do the matrix multiplication. However, I added a loop outside of the kernel used to do the matrix multiply due to the further study. In this test, I would like to use OpenMP to parallel the outside loop and OpenACC to parallel the kernel loops. There are some errors in that code. Some one can help me to solve this problem?
Following is the code. Thank you very much.

program main
    use accel_lib
    integer :: n        ! size of the vector    
    real,dimension(:,:),allocatable :: a ,b,c,c1
    real,dimension(:),allocatable :: csum  
     integer :: i,j,k,kk,mk
    integer :: t1,t2,thn
    real :: diff,st,pt,speedup 
!$ integer:: omp_get_num_threads
!$ integer:: omp_get_num_procs
!$ thn=omp_get_num_procs()
!$ write(*,*) "The number of available processors/threads in the system: ",thn
thn=1    ! when OpenMP is not used
!$ write(*,*) "Enter the number of threads"
!$ read(*,*) thn
!$ call omp_set_num_threads(thn)             ! set the number of threads
!$call acc_init( acc_device_nvidia )
    n =512
    mk=16
    allocate(a(n,n),b(n,n),c(n,n),c1(n,n),csum(mk))
    do i=1,n
        do j=1,n
             a(i,j)=(i+j)/(i)
             b(i,j)=2*(i+j)/(i)
        end do
    end do   
    call system_clock( count=t1 )   
!$omp parallel do  shared(n,a,b),private(c,kk)
    do kk=1,mk                          !CPU processing 
        c=0.0d0    
        do i=1,n
            do j=1,n
                do k=1,n
                    c(i,j)=c(i,j)+a(i,k)*b(k,j)*kk
                end do 
            end do
        end do
        csum(kk)=sum(c)
    end do
!$omp end parallel do            
    write(*,*)  csum    
    csum=0.0d0
    call system_clock( count=t2 )
    st= (t2-t1)/1.0d6  
    print *, 'CPU time: ', st,  ' seconds'     
    call system_clock( count=t1 ) 
    
    call acc_init( acc_device_nvidia )        
!$omp parallel do  shared(n,a,b),private(c1,kk)
    do kk=1,mk                            !GPU processing    
        c1=0.0d0 
        call obj(n,a,b,c1,kk)
        csum(kk)=sum(c1)
    end do  
!$omp end parallel do     
    write(*,*)  csum
    call system_clock( count=t2 )
    pt=(t2-t1)/1.0d6   
    print *, 'GPU time: ', pt,  ' seconds'    
    speedup=st/pt
    print *, 'speedup: ', speedup  
end program    

subroutine obj(n,a,b,c1,kk)
    implicit none
    integer, intent(in)::n,kk
    real, intent(in)::a(n,n),b(n,n)
    real, intent(out)::c1(n,n)
    integer::i,j,k

!$acc parallel loop 
        do j=1,n
            do i=1,n
                do k=1,n
                    c1(i,j)=c1(i,j)+a(i,k)*b(k,j)*kk
                end do 
            end do
        end do
!$acc end parallel loop

    return
end subroutine obj

What’s the error you’re seeing? There’s a performance problem in that you initialize the accelerator twice, but removing the second one fixes it. Also, since you have all the threads using the same accelerator, your relative speed-up will decrease as there’s more contention on the device:


% pgf90 -mp -acc mp.f90 -V14.3 -Minfo=accel ; a.out
obj:
     70, Accelerator kernel generated
         71, !$acc loop gang ! blockidx%x
         72, !$acc loop vector(256) ! threadidx%x
     70, Generating present_or_copyin(b(:n,:n))
         Generating present_or_copyin(a(:n,:n))
         Generating present_or_copy(c1(:n,:n))
         Generating NVIDIA code
     72, Loop is parallelizable
     73, Complex loop carried dependence of 'c1' prevents parallelization
         Loop carried dependence of 'c1' prevents parallelization
         Loop carried backward dependence of 'c1' prevents vectorization
 The number of available processors/threads in the system:            12
 Enter the number of threads
2
   2.4596063E+09   4.9192125E+09   7.3787965E+09   9.8384251E+09
   1.2298042E+10   1.4757593E+10   1.7217190E+10   1.9676850E+10
   2.2136357E+10   2.4596085E+10   2.7055665E+10   2.9515186E+10
   3.1974867E+10   3.4434380E+10   3.6893897E+10   3.9353700E+10
 CPU time:     2.648998      seconds
   2.4596063E+09   4.9192125E+09   7.3787965E+09   9.8384251E+09
   1.2298042E+10   1.4757593E+10   1.7217190E+10   1.9676850E+10
   2.2136357E+10   2.4596085E+10   2.7055665E+10   2.9515186E+10
   3.1974867E+10   3.4434380E+10   3.6893897E+10   3.9353700E+10
 GPU time:    8.2851999E-02  seconds
 speedup:     31.97265

Thanks for your help. I used the compile command you gave to me. The compile results are the same. However, I got the error information when I run the execution file as following:

 The number of available processors/threads in the system:             8
 Enter the number of threads
2
   2.4596063E+09   4.9192125E+09   7.3787965E+09   9.8384251E+09
   1.2298042E+10   1.4757593E+10   1.7217190E+10   1.9676850E+10
   2.2136357E+10   2.4596085E+10   2.7055665E+10   2.9515186E+10
   3.1974867E+10   3.4434380E+10   3.6893897E+10   3.9353700E+10
 CPU time:     2.853000      seconds
call to cuModuleLoadData returned error 201: Invalid context

Can you tell me why? Appreciate your help.

What device do you have? Older cards don’t support multiple host context creation.

  • Mat

Hi Mat,
I use GeForce GTX 780 Ti.

I don’t if the GTX 780 Ti allows for multiple context given that it’s a graphics card and not a compute card.

If it does, then check to see if the compute mode is set to “Exclusive” ( i.e. run “nvidia-smi -a” and look for “Compute Mode”).

If you’re running on Windows, then it might be the WDDM driver that’s inhibiting multiple context creation. You may need to use the Tesla Compute Cluster (TCC) driver, which on Windows is only available for Tesla cards.

  • Mat