MPI with 1 thread is faster than 8 threads in my code

there is my code that I bring it below . in the first version, I don’t use MPI send or receive . I just allocate some matrix then utilizing from them … the hardware which I use is Intel® Coreâ„¢ i7-4790 CPU @ 4.00GHz with 16Mb RAM …

after initializing MPI Fortran , the time-consuming part of my code is executed parallel as I bring it below :

do k=1,Nt

do i=2,(N_z)-1
do j=begin_col,end_col

     p(i,j)=-m(i,j)+c(i,1)*((-4._fp_kind+2._fp_kind/c(i,1))*t(i,j) &
                           +t(i-1,j) & 
                           +t(i+1,j) & 
          +(1._fp_kind-ds/(2._fp_kind*y(1,j))*t(i,j-1) & 
          +(1._fp_kind+ds/(2._fp_kind*y(1,j))*t(i,j+1))

end do
end do

end do

and its executed 4 times faster than 1 thread . when I change my code and adding some codes after that part which is not time-consuming like below, 1 thread take less time for execution than 8 threads … in 8 threads execution, my CPU usage is 100% …

do k=1,Nt

do i=2,(N_z)-1
do j=begin_col,end_col

     p(i,j)=-m(i,j)+c(i,1)*((-4._fp_kind+2._fp_kind/c(i,1))*t(i,j) &
                           +t(i-1,j) & 
                           +t(i+1,j) & 
          +(1._fp_kind-ds/(2._fp_kind*y(1,j))*t(i,j-1) & 
          +(1._fp_kind+ds/(2._fp_kind*y(1,j))*t(i,j+1))

end do
end do

if (taskid == 0) then 

      do i = 2,(N_z)-1
             p(i,1)=-(m(i,1))+(c(i,1))*&
                              ((-6._fp_kind+(2._fp_kind/(c(i,1))))*(t(i,1)) &
                              +(t(i+1,1)) &
                              +(t(i-1,1)) &
                              +4._fp_kind*(t(i,2)))
        end do
 end if

end do

I’ll appreciate if someone tells me the reason …

Best regard

Hi @@ali@@,

It’s difficult to say exactly since there’s not enough information.

For the MPI code, did you decompose the domain across multiple ranks or are you running this section redundantly?

You say that the CPU utilization was 100%. Is this per rank or total? If it’s total, then you may not be running multiple ranks or they are all bound to the same core. Note that a i7-4790 has 4 physical cores so you should probably limit your program to running 4 ranks. Using the extra 4 hyper-threads will typically hurt performance.

You might consider using the PGI profiler, pgprof, to see where your time is being spent. It may give clues as to what’s going on. See: http://www.pgroup.com/doc/pgprofug.pdf

  • Mat

thank you for your reply Mat … I decomposed the domain across multiple ranks and mu CPU usage is 100% in total for 8 threads …