How can I make -Mvect=sse and -mp work togehter?

tangoman · August 23, 2007, 9:33am

Hi all,

I am trying to tune a numerical computing program with openmp on multi core AMD machine. I found the program with -mp option is much slower than the one without -mp when it runs with one thread. I post a simple test as following:

!$OMP PARALLEL
!$OMP DO PRIVATE(i,j,k)
      do i=1,nx
         do j=1,ny
            do k=1,nz
               tmp = c0*(a(k-4,j,i)+a(k+4,j,i))
     &             + c1*(a(k-3,j,i)+a(k+3,j,i))
     &             + c2*(a(k-2,j,i)+a(k+2,j,i))
     &             + c3*(a(k-1,j,i)+a(k+1,j,i))
     &             + c4*a(k,j,i)
               b(k,j,i) = b(k,j,i)+c5*tmp
            enddo
         enddo
      enddo
!$OMP END PARALLEL

I use –Minfo option to display compile-time optimization listings. It seems that the option -Mvect=sse conflits with -mp. The defference shows as following:

pgf90 -tp k8-64 -fastsse -Minfo -Mneginfo  -c -o test.o test.f
my_test:
    19, Generated 3 alternate loops for the inner loop
        Generated vector sse code for inner loop
        Generated 2 prefetch instructions for this loop
        Generated vector sse code for inner loop
        Generated 2 prefetch instructions for this loop
        Generated vector sse code for inner loop
        Generated 2 prefetch instructions for this loop
        Generated vector sse code for inner loop
        Generated 2 prefetch instructions for this loop

pgf90 -tp k8-64 -fastsse -mp -Minfo -Mneginfo  -c -o test.o test.f
my_test:
    15, Parallel region activated
    17, Parallel loop activated; static block iteration allocation
    19, Unrolled inner loop 8 times
        Generated 2 prefetch instructions for this loop
    29, Barrier
        Parallel region terminated

How can I make them work togehter? Any suggestion is welcome.

Thanks!

brentl · August 25, 2007, 12:44am

You might need to declare tmp to be private.

tangoman · August 27, 2007, 5:56am

Yes, I made a mistake here. Thanks, brentl.

After I declared tmp as private, the optimization information is still a little different from the one without -mp flag.

15, Parallel region activated
17, Parallel loop activated; static block iteration allocation
19, Generated an alternate loop for the inner loop
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
29, Barrier
Parallel region terminated

Any suggestion?

brentl · September 13, 2007, 6:02pm

Our altcode generator makes decisions based on a number of factors, being in a parallel region among them. That is why the differences. If you find it makes a big performance difference, you should let us know. Since the code vectorizes in both cases now, the code should be running fairly well.

tangoman · September 19, 2007, 7:32am

Thanks, these two version run almost at the same speed.