there is my code that I bring it below . in the first version, I don’t use MPI send or receive . I just allocate some matrix then utilizing from them … the hardware which I use is IntelÂ® Coreâ„¢ i7-4790 CPU @ 4.00GHz with 16Mb RAM …

after initializing MPI Fortran , the time-consuming part of my code is executed parallel as I bring it below :

```
do k=1,Nt
do i=2,(N_z)-1
do j=begin_col,end_col
p(i,j)=-m(i,j)+c(i,1)*((-4._fp_kind+2._fp_kind/c(i,1))*t(i,j) &
+t(i-1,j) &
+t(i+1,j) &
+(1._fp_kind-ds/(2._fp_kind*y(1,j))*t(i,j-1) &
+(1._fp_kind+ds/(2._fp_kind*y(1,j))*t(i,j+1))
end do
end do
end do
```

and its executed 4 times faster than 1 thread . when I change my code and adding some codes after that part which is not time-consuming like below, 1 thread take less time for execution than 8 threads … in 8 threads execution, my CPU usage is 100% …

```
do k=1,Nt
do i=2,(N_z)-1
do j=begin_col,end_col
p(i,j)=-m(i,j)+c(i,1)*((-4._fp_kind+2._fp_kind/c(i,1))*t(i,j) &
+t(i-1,j) &
+t(i+1,j) &
+(1._fp_kind-ds/(2._fp_kind*y(1,j))*t(i,j-1) &
+(1._fp_kind+ds/(2._fp_kind*y(1,j))*t(i,j+1))
end do
end do
if (taskid == 0) then
do i = 2,(N_z)-1
p(i,1)=-(m(i,1))+(c(i,1))*&
((-6._fp_kind+(2._fp_kind/(c(i,1))))*(t(i,1)) &
+(t(i+1,1)) &
+(t(i-1,1)) &
+4._fp_kind*(t(i,2)))
end do
end if
end do
```

I’ll appreciate if someone tells me the reason …

Best regard