What is wrong with this parallel code?

Hi, I have a piece of following code, in which the DAXPY part I coded it in parallel. However, no matter how many processors I use, there is almost no speedup. I have checked the code for many times but can’t find out the reason of the problem. Any one can help? Thank you so much!!

Here is the code:

      program xgemmtest
      implicit real*8 (A-H,O-Z)
      integer i,j,k,l
      integer n,m
      parameter (n=3432,m=14)
      integer omp_get_num_threads
      integer NP,NChunk
      Real*8 XA(n,n),XCa(n,n),XMA(n,n),Dia(n)
      Real XList((n/2)*(m*(m-1)/2))
      Integer List(2,(n/2)*(m*(m-1)/2))
      character DayTim*24
      real*8 Zero, One, Ten
      data Zero/0.0d0/, One/1.0d0/, Ten/10.0d0/
C
      do i=1,n
        Dia(i)=Zero
        do j=1,n
          XA(j,i)=Zero
          XMA(j,i)=Zero
        end do
      end do
C
      do i=1,n
        do j=1,n
          XCa(j,i)=One/(Ten**4)
        end do
      end do
C
      NP=1
C$OMP PARALLEL 
C$OMP SINGLE
C$      NP=omp_get_num_threads()
C$OMP END SINGLE
C$OMP END PARALLEL
         call FDate(DayTim)
        write(*,*) DayTim
C
      do iijj=1,50
      icont=0
      itmp=0
      itmp2=0
      do i=1,n
        Dia(i)=real(((i*(i-1))/2)+i)
        if(i.eq.(itmp2*68+1))then
          itmp2=itmp2+1
          do j=i+1,n
            itmp=itmp+1
            l=((j*(j-1))/2)+i
            icont=icont+1
            List(1,icont)=j
            List(2,icont)=i
            XList(icont)=real(l)
          end do
        endif
      end do
C
      NChunk=(n+NP-1)/NP
C$OMP Parallel Do Schedule(dynamic,NChunk) Default(Shared)
C$OMP+  Private(IDial,val,Icol)
      do IDial=1,n
        val=Dia(IDial)
        do Icol=1,n
          XMA(Icol,IDial)=XMA(Icol,IDial)+val*XCa(Icol,IDial)
          XA(Icol,IDial)=XA(Icol,IDial)+val*XCa(Icol,IDial)
        end do
      end do
C$OMP end parallel do
      NChunk=(icont+NP-1)/NP
C$OMP Parallel Do Schedule(dynamic,NChunk) Default(Shared)
C$OMP+  Private(IND,Ktmp,Ltmp,val,icol)
      do IND=1,icont
        Ktmp=List(1,IND)
        Ltmp=List(2,IND)
        val=XList(IND)
        do icol=1,n
          XMA(icol,Ktmp)=XMA(icol,Ktmp)+val*XCa(icol,Ltmp)
          XMA(icol,Ltmp)=XMA(icol,Ltmp)+val*XCa(icol,Ktmp)
          XA(icol,Ktmp)=XA(icol,Ktmp)+val*XCa(icol,Ltmp)
          XA(icol,Ltmp)=XA(icol,Ltmp)+val*XCa(icol,Ktmp)
        end do
      end do
C$OMP end parallel do
      end do
         call FDate(DayTim)
        write(*,*) DayTim
      stop
      end

The compiler and library I used are:

pgf77 -i8  -mp  -O2 -tp p7-64 -time -fast -o xvect.exe xvect.F -lacml

Thank you!

Hi Sharp,

I ran your code on a 2 Socket/8 Core Penryn and 2 Socket/8 Core Barcelona system, and got the following run times:

Threads / Sockets Barcelona Penryn
1 / 1 - 2:11 1:55
2 / 1 - 1:28 1:53
2 / 2 - 1:28 1:16
4 / 1 - 1:29 1:55
4 / 2 - 1:14 1:34
8 / 2 - 1:14 1:14

The code does show speed-up, but seems to have problems when multiple threads are bound to the same socket. Most likely you’re hitting a memory bandwidth limit.

  • Mat