trying to parallelize inverse on 1 matrix by OpenACC

Hi,everyone

I really need some advice,please help!
i’m trying to port a F90 code to GPU by OpenACC ,but the serial code contains an operation of matrix inversion.if i let it run on CPU, it will cost extra time in data movement,making the parallelizing much slower. so i wonder if matrix inverse can be parallelized properly or are there any well-developed parallel algorithm for matrix inverse.
here is the code

subroutine invam(a,n)
use head

dimension a(n,n),is(n),js(n)
double precision a,dd,t

l=1

do 100 k=1,n
dd=0.
do 10 i=k,n
do 10 j=k,n
if(abs(a(i,j)).gt.dd) then
	dd=abs(a(i,j))
	is(k)=i
	js(k)=j
endif
10 continue
if(dd+1.0.eq.1.0) then
	l=0
	print *, 'err, not inv'
!	return
endif
do 30 j=1,n
t=a(k,j)
a(k,j)=a(is(k),j)
a(is(k),j)=t
30 continue
do 40 i=1,n
t=a(i,k)
a(i,k)=a(i,js(k))
a(i,js(k))=t
40 continue
a(k,k)=1./a(k,k)
do 50 j=1,n
if(j.ne.k) then
	a(k,j)=a(k,j)*a(k,k)
endif
50 continue
do 70 i=1,n
if(i.ne.k) then
	do 60 j=1,n
	if(j.ne.k) then
		a(i,j)=a(i,j)-a(i,k)*a(k,j)
	endif
60	continue
endif
70 continue
do 80 i=1,n
if(i.ne.k) then
	a(i,k)=-a(i,k)*a(k,k)
endif
80 continue
100 continue
do 130 k=n,1,-1
do 110 j=1,n
t=a(k,j)
a(k,j)=a(js(k),j)
a(js(k),j)=t
110 continue
do 120 i=1,n
t=a(i,k)
a(i,k)=a(i,is(k))
a(i,is(k))=t
120 continue
130 continue

return
end subroutine invam

please ,any advice on parallelizing this code or some other document on parallel algorithm for matrix inverse.
thanks!

Hi Guo shuhao,

Do you have a complete example? I tried to write a small driver for your routine but got NANs. Not sure if there’s an error in your code or if it a problem with what I did.

-Mat

thanks for your quick reply! i uploaded the code example and the wholefield.dat for initialization on GitHub. you can check it https://github.com/Dovahkiinsk/code-example

thanks for your help!

Ok, I gave it a try.

Basically, I don’t think you can parallelize the “k” loops due to the dependencies. I looked at a CUDA matrix inversion code, and they didn’t parallelize the outer loop either, so I’m think this is on the correct track. Though, I’m not an expert in this algorithm so this is only a suggestion.

I did move the loop that computes the is and js values out of the main k loop, which may or may not be correct. Though the output the values are the same as the CPU version. I’ll leave it to you if this is correct. If not, then these loops can’t be parallelized due to the dependency on “dd”.

subroutine invam(a,n)
use head

dimension a(n,n),is(n),js(n)
double precision a,dd,t

l=1

!$acc data copy(a) create(is,js)

!$acc kernels loop
do 101 k=1,n
dd=0.
do 10 i=k,n
do 10 j=k,n
if(abs(a(i,j)).gt.dd) then
        dd=abs(a(i,j))
        is(k)=i
        js(k)=j
endif
10 continue

if(dd+1.0.eq.1.0) then
        l=0
        print *, 'err, not inv'
!       return
endif
101 continue


do 100 k=1,n
!$acc kernels loop
do 30 j=1,n
t=a(k,j)
a(k,j)=a(is(k),j)
a(is(k),j)=t
30 continue

!$acc kernels loop
do 40 i=1,n
t=a(i,k)
a(i,k)=a(i,js(k))
a(i,js(k))=t
40 continue

!$acc serial
a(k,k)=1./a(k,k)
!$acc end serial

!$acc parallel loop
do 50 j=1,n
if(j.ne.k) then
        a(k,j)=a(k,j)*a(k,k)
endif
50 continue

!$acc parallel loop
do 70 i=1,n
if(i.ne.k) then
!$acc loop
        do 60 j=1,n
        if(j.ne.k) then
                a(i,j)=a(i,j)-a(i,k)*a(k,j)
        endif
60      continue
endif
70 continue

!$acc parallel loop
do 80 i=1,n
if(i.ne.k) then
        a(i,k)=-a(i,k)*a(k,k)
endif
80 continue
100 continue

do 130 k=n,1,-1
!$acc parallel loop
do 110 j=1,n
t=a(k,j)
a(k,j)=a(js(k),j)
a(js(k),j)=t
110 continue

!$acc parallel loop
do 120 i=1,n
t=a(i,k)
a(i,k)=a(i,is(k))
a(i,is(k))=t
120 continue
130 continue

!$acc end data

return
end subroutine invam

-Mat

Thanks, Mat.
i did some test ,it can be ported to the GPU, though the speedup were not so satisfying.but still thanks!
and i got an new interesting thing and i wonder how it works.
here is the code,it’s a triple loop

!$acc kernels 
do 21 ip=1,ncy

do i=iac_cy-nij_cy-5,iac_cy+nij_cy+5!
do j=jac_cy-nij_cy*2-5,jac_cy+nij_cy*2+5!
drx=(x(i)-xycy(ip,1))/msp
dry=(y(j)-xycy(ip,2))/msp
if(abs(drx).le.dh.and.abs(dry).le.dh) then
	if(abs(drx).le.0.5) then
		detax=3./8.+pi/32.-drx**2/4.
	elseif(abs(drx).le.1.5) then
		detax=1./4.+(1.-abs(drx))*sqrt(-2.+8.*abs(drx)&
     		 -4.*drx**2)/8.-asin(sqrt(2.)*(abs(drx)-1.))/8.
	else
		detax=17./16.-pi/64.-3.*abs(drx)/4.+drx**2/8.&
     		 +(abs(drx)-2.)*sqrt(-14.+16*abs(drx)-4.*drx**2)/16.&
     		 +asin(sqrt(2.)*(abs(drx)-2.))/16.
	endif
	if(abs(dry).le.0.5) then
		detay=3./8.+pi/32.-dry**2/4.
	elseif(abs(dry).le.1.5) then
		detay=1./4.+(1.-abs(dry))*sqrt(-2.+8.*abs(dry)&
     		 -4.*dry**2)/8.-asin(sqrt(2.)*(abs(dry)-1.))/8.
	else
		detay=17./16.-pi/64.-3.*abs(dry)/4.+dry**2/8.&
     		 +(abs(dry)-2.)*sqrt(-14.+16*abs(dry)-4.*dry**2)/16.&
     		 +asin(sqrt(2.)*(abs(dry)-2.))/16.
	endif
	ux(i,j)=ux(i,j)-0.5*fbcy(ip,1)*detax*detay*dt/msp/msp
	uy(i,j)=uy(i,j)-0.5*fbcy(ip,2)*detax*detay*dt/msp/msp

	ffx(i,j)=ffx(i,j)-fbcy(ip,1)*detax*detay*dt/msp/msp
	ffy(i,j)=ffy(i,j)-fbcy(ip,2)*detax*detay*dt/msp/msp
end if
end do
end do

21 continue
!$acc end kernels

and i got a feedback of compiler like this

    324, Loop is parallelizable
    325, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        322, !$acc loop seq
        324, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
        325, !$acc loop gang ! blockidx%y

it seems the outer loop is operated serially. so i move the kernels clause to the innner doubly nested loop,like this

do 21 ip=1,ncy
!$acc kernels
do i=iac_cy-nij_cy-5,iac_cy+nij_cy+5!
do j=jac_cy-nij_cy*2-5,jac_cy+nij_cy*2+5!
......

it’s a part of my program and i only modified this place,the rest remained the same.but after the modification, it’s slower!
in the first version, the whole program cost 3067s.
but in the second version,it cost 3670s.
i wonder what’s the difference between these two versions of the codes. it seems both outer loops are operated serially,but the running time differs.by the way, the serial version of the program costs 31881s.

Try running with “PGI_ACC_TIME=1” to get a quick profile for each. This will give a better idea what’s going on. Though, if you have a data set that runs quicker, that may be helpful.

I think the compiler might interchanging the loops in the first case. I.e. changing it to:

!$acc kernels 
do i=iac_cy-nij_cy-5,iac_cy+nij_cy+5! 
do j=jac_cy-nij_cy*2-5,jac_cy+nij_cy*2+5! 
do 21 ip=1,ncy

This way, the loops can launched as one kernel versus “ncy” kernels.

The output from profiling will help to confirm this.

-Mat