# trying to parallelize inverse on 1 matrix by OpenACC

Hi,everyone

i’m trying to port a F90 code to GPU by OpenACC ,but the serial code contains an operation of matrix inversion.if i let it run on CPU, it will cost extra time in data movement,making the parallelizing much slower. so i wonder if matrix inverse can be parallelized properly or are there any well-developed parallel algorithm for matrix inverse.
here is the code

``````subroutine invam(a,n)

dimension a(n,n),is(n),js(n)
double precision a,dd,t

l=1

do 100 k=1,n
dd=0.
do 10 i=k,n
do 10 j=k,n
if(abs(a(i,j)).gt.dd) then
dd=abs(a(i,j))
is(k)=i
js(k)=j
endif
10 continue
if(dd+1.0.eq.1.0) then
l=0
print *, 'err, not inv'
!	return
endif
do 30 j=1,n
t=a(k,j)
a(k,j)=a(is(k),j)
a(is(k),j)=t
30 continue
do 40 i=1,n
t=a(i,k)
a(i,k)=a(i,js(k))
a(i,js(k))=t
40 continue
a(k,k)=1./a(k,k)
do 50 j=1,n
if(j.ne.k) then
a(k,j)=a(k,j)*a(k,k)
endif
50 continue
do 70 i=1,n
if(i.ne.k) then
do 60 j=1,n
if(j.ne.k) then
a(i,j)=a(i,j)-a(i,k)*a(k,j)
endif
60	continue
endif
70 continue
do 80 i=1,n
if(i.ne.k) then
a(i,k)=-a(i,k)*a(k,k)
endif
80 continue
100 continue
do 130 k=n,1,-1
do 110 j=1,n
t=a(k,j)
a(k,j)=a(js(k),j)
a(js(k),j)=t
110 continue
do 120 i=1,n
t=a(i,k)
a(i,k)=a(i,is(k))
a(i,is(k))=t
120 continue
130 continue

return
end subroutine invam
``````

please ,any advice on parallelizing this code or some other document on parallel algorithm for matrix inverse.
thanks!

Hi Guo shuhao,

Do you have a complete example? I tried to write a small driver for your routine but got NANs. Not sure if there’s an error in your code or if it a problem with what I did.

-Mat

thanks for your quick reply! i uploaded the code example and the wholefield.dat for initialization on GitHub. you can check it https://github.com/Dovahkiinsk/code-example

Ok, I gave it a try.

Basically, I don’t think you can parallelize the “k” loops due to the dependencies. I looked at a CUDA matrix inversion code, and they didn’t parallelize the outer loop either, so I’m think this is on the correct track. Though, I’m not an expert in this algorithm so this is only a suggestion.

I did move the loop that computes the is and js values out of the main k loop, which may or may not be correct. Though the output the values are the same as the CPU version. I’ll leave it to you if this is correct. If not, then these loops can’t be parallelized due to the dependency on “dd”.

``````subroutine invam(a,n)

dimension a(n,n),is(n),js(n)
double precision a,dd,t

l=1

!\$acc data copy(a) create(is,js)

!\$acc kernels loop
do 101 k=1,n
dd=0.
do 10 i=k,n
do 10 j=k,n
if(abs(a(i,j)).gt.dd) then
dd=abs(a(i,j))
is(k)=i
js(k)=j
endif
10 continue

if(dd+1.0.eq.1.0) then
l=0
print *, 'err, not inv'
!       return
endif
101 continue

do 100 k=1,n
!\$acc kernels loop
do 30 j=1,n
t=a(k,j)
a(k,j)=a(is(k),j)
a(is(k),j)=t
30 continue

!\$acc kernels loop
do 40 i=1,n
t=a(i,k)
a(i,k)=a(i,js(k))
a(i,js(k))=t
40 continue

!\$acc serial
a(k,k)=1./a(k,k)
!\$acc end serial

!\$acc parallel loop
do 50 j=1,n
if(j.ne.k) then
a(k,j)=a(k,j)*a(k,k)
endif
50 continue

!\$acc parallel loop
do 70 i=1,n
if(i.ne.k) then
!\$acc loop
do 60 j=1,n
if(j.ne.k) then
a(i,j)=a(i,j)-a(i,k)*a(k,j)
endif
60      continue
endif
70 continue

!\$acc parallel loop
do 80 i=1,n
if(i.ne.k) then
a(i,k)=-a(i,k)*a(k,k)
endif
80 continue
100 continue

do 130 k=n,1,-1
!\$acc parallel loop
do 110 j=1,n
t=a(k,j)
a(k,j)=a(js(k),j)
a(js(k),j)=t
110 continue

!\$acc parallel loop
do 120 i=1,n
t=a(i,k)
a(i,k)=a(i,is(k))
a(i,is(k))=t
120 continue
130 continue

!\$acc end data

return
end subroutine invam
``````

-Mat

Thanks, Mat.
i did some test ,it can be ported to the GPU, though the speedup were not so satisfying.but still thanks!
and i got an new interesting thing and i wonder how it works.
here is the code,it’s a triple loop

``````!\$acc kernels
do 21 ip=1,ncy

do i=iac_cy-nij_cy-5,iac_cy+nij_cy+5!
do j=jac_cy-nij_cy*2-5,jac_cy+nij_cy*2+5!
drx=(x(i)-xycy(ip,1))/msp
dry=(y(j)-xycy(ip,2))/msp
if(abs(drx).le.dh.and.abs(dry).le.dh) then
if(abs(drx).le.0.5) then
detax=3./8.+pi/32.-drx**2/4.
elseif(abs(drx).le.1.5) then
detax=1./4.+(1.-abs(drx))*sqrt(-2.+8.*abs(drx)&
-4.*drx**2)/8.-asin(sqrt(2.)*(abs(drx)-1.))/8.
else
detax=17./16.-pi/64.-3.*abs(drx)/4.+drx**2/8.&
+(abs(drx)-2.)*sqrt(-14.+16*abs(drx)-4.*drx**2)/16.&
+asin(sqrt(2.)*(abs(drx)-2.))/16.
endif
if(abs(dry).le.0.5) then
detay=3./8.+pi/32.-dry**2/4.
elseif(abs(dry).le.1.5) then
detay=1./4.+(1.-abs(dry))*sqrt(-2.+8.*abs(dry)&
-4.*dry**2)/8.-asin(sqrt(2.)*(abs(dry)-1.))/8.
else
detay=17./16.-pi/64.-3.*abs(dry)/4.+dry**2/8.&
+(abs(dry)-2.)*sqrt(-14.+16*abs(dry)-4.*dry**2)/16.&
+asin(sqrt(2.)*(abs(dry)-2.))/16.
endif
ux(i,j)=ux(i,j)-0.5*fbcy(ip,1)*detax*detay*dt/msp/msp
uy(i,j)=uy(i,j)-0.5*fbcy(ip,2)*detax*detay*dt/msp/msp

ffx(i,j)=ffx(i,j)-fbcy(ip,1)*detax*detay*dt/msp/msp
ffy(i,j)=ffy(i,j)-fbcy(ip,2)*detax*detay*dt/msp/msp
end if
end do
end do

21 continue
!\$acc end kernels
``````

and i got a feedback of compiler like this

``````    324, Loop is parallelizable
325, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
322, !\$acc loop seq
324, !\$acc loop gang, vector(128) ! blockidx%x threadidx%x
325, !\$acc loop gang ! blockidx%y
``````

it seems the outer loop is operated serially. so i move the kernels clause to the innner doubly nested loop,like this

``````do 21 ip=1,ncy
!\$acc kernels
do i=iac_cy-nij_cy-5,iac_cy+nij_cy+5!
do j=jac_cy-nij_cy*2-5,jac_cy+nij_cy*2+5!
......
``````

it’s a part of my program and i only modified this place,the rest remained the same.but after the modification, it’s slower!
in the first version, the whole program cost 3067s.
but in the second version,it cost 3670s.
i wonder what’s the difference between these two versions of the codes. it seems both outer loops are operated serially,but the running time differs.by the way, the serial version of the program costs 31881s.

Try running with “PGI_ACC_TIME=1” to get a quick profile for each. This will give a better idea what’s going on. Though, if you have a data set that runs quicker, that may be helpful.

I think the compiler might interchanging the loops in the first case. I.e. changing it to:

``````!\$acc kernels
do i=iac_cy-nij_cy-5,iac_cy+nij_cy+5!
do j=jac_cy-nij_cy*2-5,jac_cy+nij_cy*2+5!
do 21 ip=1,ncy
``````

This way, the loops can launched as one kernel versus “ncy” kernels.

The output from profiling will help to confirm this.

-Mat