Hi, I have a piece of following code, in which the DAXPY part I coded it in parallel. However, no matter how many processors I use, there is almost no speedup. I have checked the code for many times but can’t find out the reason of the problem. Any one can help? Thank you so much!!
Here is the code:
program xgemmtest
implicit real*8 (A-H,O-Z)
integer i,j,k,l
integer n,m
parameter (n=3432,m=14)
integer omp_get_num_threads
integer NP,NChunk
Real*8 XA(n,n),XCa(n,n),XMA(n,n),Dia(n)
Real XList((n/2)*(m*(m-1)/2))
Integer List(2,(n/2)*(m*(m-1)/2))
character DayTim*24
real*8 Zero, One, Ten
data Zero/0.0d0/, One/1.0d0/, Ten/10.0d0/
C
do i=1,n
Dia(i)=Zero
do j=1,n
XA(j,i)=Zero
XMA(j,i)=Zero
end do
end do
C
do i=1,n
do j=1,n
XCa(j,i)=One/(Ten**4)
end do
end do
C
NP=1
C$OMP PARALLEL
C$OMP SINGLE
C$ NP=omp_get_num_threads()
C$OMP END SINGLE
C$OMP END PARALLEL
call FDate(DayTim)
write(*,*) DayTim
C
do iijj=1,50
icont=0
itmp=0
itmp2=0
do i=1,n
Dia(i)=real(((i*(i-1))/2)+i)
if(i.eq.(itmp2*68+1))then
itmp2=itmp2+1
do j=i+1,n
itmp=itmp+1
l=((j*(j-1))/2)+i
icont=icont+1
List(1,icont)=j
List(2,icont)=i
XList(icont)=real(l)
end do
endif
end do
C
NChunk=(n+NP-1)/NP
C$OMP Parallel Do Schedule(dynamic,NChunk) Default(Shared)
C$OMP+ Private(IDial,val,Icol)
do IDial=1,n
val=Dia(IDial)
do Icol=1,n
XMA(Icol,IDial)=XMA(Icol,IDial)+val*XCa(Icol,IDial)
XA(Icol,IDial)=XA(Icol,IDial)+val*XCa(Icol,IDial)
end do
end do
C$OMP end parallel do
NChunk=(icont+NP-1)/NP
C$OMP Parallel Do Schedule(dynamic,NChunk) Default(Shared)
C$OMP+ Private(IND,Ktmp,Ltmp,val,icol)
do IND=1,icont
Ktmp=List(1,IND)
Ltmp=List(2,IND)
val=XList(IND)
do icol=1,n
XMA(icol,Ktmp)=XMA(icol,Ktmp)+val*XCa(icol,Ltmp)
XMA(icol,Ltmp)=XMA(icol,Ltmp)+val*XCa(icol,Ktmp)
XA(icol,Ktmp)=XA(icol,Ktmp)+val*XCa(icol,Ltmp)
XA(icol,Ltmp)=XA(icol,Ltmp)+val*XCa(icol,Ktmp)
end do
end do
C$OMP end parallel do
end do
call FDate(DayTim)
write(*,*) DayTim
stop
end
The compiler and library I used are:
pgf77 -i8 -mp -O2 -tp p7-64 -time -fast -o xvect.exe xvect.F -lacml
Thank you!