Hi all,
I am trying to tune a numerical computing program with openmp on multi core AMD machine. I found the program with -mp option is much slower than the one without -mp when it runs with one thread. I post a simple test as following:
!$OMP PARALLEL
!$OMP DO PRIVATE(i,j,k)
do i=1,nx
do j=1,ny
do k=1,nz
tmp = c0*(a(k-4,j,i)+a(k+4,j,i))
& + c1*(a(k-3,j,i)+a(k+3,j,i))
& + c2*(a(k-2,j,i)+a(k+2,j,i))
& + c3*(a(k-1,j,i)+a(k+1,j,i))
& + c4*a(k,j,i)
b(k,j,i) = b(k,j,i)+c5*tmp
enddo
enddo
enddo
!$OMP END PARALLEL
I use –Minfo option to display compile-time optimization listings. It seems that the option -Mvect=sse conflits with -mp. The defference shows as following:
pgf90 -tp k8-64 -fastsse -Minfo -Mneginfo -c -o test.o test.f
my_test:
19, Generated 3 alternate loops for the inner loop
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
pgf90 -tp k8-64 -fastsse -mp -Minfo -Mneginfo -c -o test.o test.f
my_test:
15, Parallel region activated
17, Parallel loop activated; static block iteration allocation
19, Unrolled inner loop 8 times
Generated 2 prefetch instructions for this loop
29, Barrier
Parallel region terminated
How can I make them work togehter? Any suggestion is welcome.
Thanks!