How can I make -Mvect=sse and -mp work togehter?

Hi all,

I am trying to tune a numerical computing program with openmp on multi core AMD machine. I found the program with -mp option is much slower than the one without -mp when it runs with one thread. I post a simple test as following:

!$OMP PARALLEL
!$OMP DO PRIVATE(i,j,k)
      do i=1,nx
         do j=1,ny
            do k=1,nz
               tmp = c0*(a(k-4,j,i)+a(k+4,j,i))
     &             + c1*(a(k-3,j,i)+a(k+3,j,i))
     &             + c2*(a(k-2,j,i)+a(k+2,j,i))
     &             + c3*(a(k-1,j,i)+a(k+1,j,i))
     &             + c4*a(k,j,i)
               b(k,j,i) = b(k,j,i)+c5*tmp
            enddo
         enddo
      enddo
!$OMP END PARALLEL

I use –Minfo option to display compile-time optimization listings. It seems that the option -Mvect=sse conflits with -mp. The defference shows as following:

pgf90 -tp k8-64 -fastsse -Minfo -Mneginfo  -c -o test.o test.f
my_test:
    19, Generated 3 alternate loops for the inner loop
        Generated vector sse code for inner loop
        Generated 2 prefetch instructions for this loop
        Generated vector sse code for inner loop
        Generated 2 prefetch instructions for this loop
        Generated vector sse code for inner loop
        Generated 2 prefetch instructions for this loop
        Generated vector sse code for inner loop
        Generated 2 prefetch instructions for this loop

pgf90 -tp k8-64 -fastsse -mp -Minfo -Mneginfo  -c -o test.o test.f
my_test:
    15, Parallel region activated
    17, Parallel loop activated; static block iteration allocation
    19, Unrolled inner loop 8 times
        Generated 2 prefetch instructions for this loop
    29, Barrier
        Parallel region terminated

How can I make them work togehter? Any suggestion is welcome.

Thanks!

You might need to declare tmp to be private.

Yes, I made a mistake here. Thanks, brentl.

After I declared tmp as private, the optimization information is still a little different from the one without -mp flag.

15, Parallel region activated
17, Parallel loop activated; static block iteration allocation
19, Generated an alternate loop for the inner loop
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
29, Barrier
Parallel region terminated

Any suggestion?

Our altcode generator makes decisions based on a number of factors, being in a parallel region among them. That is why the differences. If you find it makes a big performance difference, you should let us know. Since the code vectorizes in both cases now, the code should be running fairly well.

Thanks, these two version run almost at the same speed.