combine the OpenMP with the OpenACC

Chia-WenChan52986 · April 16, 2014, 7:22am

I program a simple code to do the matrix multiplication. However, I added a loop outside of the kernel used to do the matrix multiply due to the further study. In this test, I would like to use OpenMP to parallel the outside loop and OpenACC to parallel the kernel loops. There are some errors in that code. Some one can help me to solve this problem?
Following is the code. Thank you very much.

program main
    use accel_lib
    integer :: n        ! size of the vector    
    real,dimension(:,:),allocatable :: a ,b,c,c1
    real,dimension(:),allocatable :: csum  
     integer :: i,j,k,kk,mk
    integer :: t1,t2,thn
    real :: diff,st,pt,speedup 
!$ integer:: omp_get_num_threads
!$ integer:: omp_get_num_procs
!$ thn=omp_get_num_procs()
!$ write(*,*) "The number of available processors/threads in the system: ",thn
thn=1    ! when OpenMP is not used
!$ write(*,*) "Enter the number of threads"
!$ read(*,*) thn
!$ call omp_set_num_threads(thn)             ! set the number of threads
!$call acc_init( acc_device_nvidia )
    n =512
    mk=16
    allocate(a(n,n),b(n,n),c(n,n),c1(n,n),csum(mk))
    do i=1,n
        do j=1,n
             a(i,j)=(i+j)/(i)
             b(i,j)=2*(i+j)/(i)
        end do
    end do   
    call system_clock( count=t1 )   
!$omp parallel do  shared(n,a,b),private(c,kk)
    do kk=1,mk                          !CPU processing 
        c=0.0d0    
        do i=1,n
            do j=1,n
                do k=1,n
                    c(i,j)=c(i,j)+a(i,k)*b(k,j)*kk
                end do 
            end do
        end do
        csum(kk)=sum(c)
    end do
!$omp end parallel do            
    write(*,*)  csum    
    csum=0.0d0
    call system_clock( count=t2 )
    st= (t2-t1)/1.0d6  
    print *, 'CPU time: ', st,  ' seconds'     
    call system_clock( count=t1 ) 
    
    call acc_init( acc_device_nvidia )        
!$omp parallel do  shared(n,a,b),private(c1,kk)
    do kk=1,mk                            !GPU processing    
        c1=0.0d0 
        call obj(n,a,b,c1,kk)
        csum(kk)=sum(c1)
    end do  
!$omp end parallel do     
    write(*,*)  csum
    call system_clock( count=t2 )
    pt=(t2-t1)/1.0d6   
    print *, 'GPU time: ', pt,  ' seconds'    
    speedup=st/pt
    print *, 'speedup: ', speedup  
end program    

subroutine obj(n,a,b,c1,kk)
    implicit none
    integer, intent(in)::n,kk
    real, intent(in)::a(n,n),b(n,n)
    real, intent(out)::c1(n,n)
    integer::i,j,k

!$acc parallel loop 
        do j=1,n
            do i=1,n
                do k=1,n
                    c1(i,j)=c1(i,j)+a(i,k)*b(k,j)*kk
                end do 
            end do
        end do
!$acc end parallel loop

    return
end subroutine obj

MatColgrove · April 17, 2014, 12:43am

What’s the error you’re seeing? There’s a performance problem in that you initialize the accelerator twice, but removing the second one fixes it. Also, since you have all the threads using the same accelerator, your relative speed-up will decrease as there’s more contention on the device:

% pgf90 -mp -acc mp.f90 -V14.3 -Minfo=accel ; a.out
obj:
     70, Accelerator kernel generated
         71, !$acc loop gang ! blockidx%x
         72, !$acc loop vector(256) ! threadidx%x
     70, Generating present_or_copyin(b(:n,:n))
         Generating present_or_copyin(a(:n,:n))
         Generating present_or_copy(c1(:n,:n))
         Generating NVIDIA code
     72, Loop is parallelizable
     73, Complex loop carried dependence of 'c1' prevents parallelization
         Loop carried dependence of 'c1' prevents parallelization
         Loop carried backward dependence of 'c1' prevents vectorization
 The number of available processors/threads in the system:            12
 Enter the number of threads
2
   2.4596063E+09   4.9192125E+09   7.3787965E+09   9.8384251E+09
   1.2298042E+10   1.4757593E+10   1.7217190E+10   1.9676850E+10
   2.2136357E+10   2.4596085E+10   2.7055665E+10   2.9515186E+10
   3.1974867E+10   3.4434380E+10   3.6893897E+10   3.9353700E+10
 CPU time:     2.648998      seconds
   2.4596063E+09   4.9192125E+09   7.3787965E+09   9.8384251E+09
   1.2298042E+10   1.4757593E+10   1.7217190E+10   1.9676850E+10
   2.2136357E+10   2.4596085E+10   2.7055665E+10   2.9515186E+10
   3.1974867E+10   3.4434380E+10   3.6893897E+10   3.9353700E+10
 GPU time:    8.2851999E-02  seconds
 speedup:     31.97265

Chia-WenChan52986 · April 17, 2014, 1:58am

Thanks for your help. I used the compile command you gave to me. The compile results are the same. However, I got the error information when I run the execution file as following:

 The number of available processors/threads in the system:             8
 Enter the number of threads
2
   2.4596063E+09   4.9192125E+09   7.3787965E+09   9.8384251E+09
   1.2298042E+10   1.4757593E+10   1.7217190E+10   1.9676850E+10
   2.2136357E+10   2.4596085E+10   2.7055665E+10   2.9515186E+10
   3.1974867E+10   3.4434380E+10   3.6893897E+10   3.9353700E+10
 CPU time:     2.853000      seconds
call to cuModuleLoadData returned error 201: Invalid context

Can you tell me why? Appreciate your help.

MatColgrove · April 17, 2014, 2:52pm

What device do you have? Older cards don’t support multiple host context creation.

Mat

Chia-WenChan52986 · April 22, 2014, 5:57am

Hi Mat,
I use GeForce GTX 780 Ti.

MatColgrove · April 22, 2014, 6:59pm

I don’t if the GTX 780 Ti allows for multiple context given that it’s a graphics card and not a compute card.

If it does, then check to see if the compute mode is set to “Exclusive” ( i.e. run “nvidia-smi -a” and look for “Compute Mode”).

If you’re running on Windows, then it might be the WDDM driver that’s inhibiting multiple context creation. You may need to use the Tesla Compute Cluster (TCC) driver, which on Windows is only available for Tesla cards.

Mat

Topic		Replies	Views
OpenMP, OpenACC and acc_set_device_num Legacy PGI Compilers	12	10761	March 15, 2013
multiple small matrices multiplication Legacy PGI Compilers	6	5691	June 22, 2015
Combining OpenMP and OpenACC Legacy PGI Compilers	4	6166	November 14, 2017
Why my OpenACC code remains slower than OpenMP? Legacy PGI Compilers	3	3923	July 26, 2013
MatMul with openACC Legacy PGI Compilers	7	13025	December 17, 2012
"invalid context" when mixing OpenMP, OpenAcc Legacy PGI Compilers	2	3229	January 31, 2014
accelerate a single loop with mpi and gpu Legacy PGI Compilers	21	15836	July 19, 2013
Can still use OMP_NUM_THREADS without OpenMP compilation Legacy PGI Compilers	4	2892	November 12, 2019
Unified binary for accelerators, serial? Legacy PGI Compilers	7	8353	November 6, 2013
OpenACC routine call inside OpenMP parallel loop Legacy PGI Compilers	7	1130	October 12, 2021

combine the OpenMP with the OpenACC

Related topics