SSE option in pgf77 7.0 version

xl5 · November 4, 2008, 4:21pm

We recently encountered a problem when developing our code. I will post my code example to illustrate the problem. The basis problem is that giving two double precision complex vector, we want to get the product of the individual elements of the two vectors and we have two versions of the code, one is done through direct assignment and the other one is done through dummy variables. In the examples given below, they are really basic programs, this kind of problem is related to a big program which is developed from Gaussian 03. So please be patient if you see anything very simple and maybe naive, you will find out that this will not be so simple in the end. Here is the code
Version 1
Program ZMULtst
Implicit Real8 (a-h,o-z)
Parameter(Nmax=1000)
Complex16 A(Nmax),B(Nmax),C(Nmax)
Read*,N
Do I=1,N
Read*,Dummy1,Dummy2
A(I)=Dcmplx(Dummy1,Dummy2)
Enddo
Do I=1,N
Read*,Dummy1,Dummy2
B(I)=Dcmplx(Dummy1,Dummy2)
Enddo
Write(6,) ‘Vectors A B’
Do I=1,N
Write(6,) A(I),B(I)
Enddo
Call ZMul(A,B,A,N)
Write(6,) ‘After multiplication: Result’
Do I=1,N
Write(6,) A(I)
Enddo
End
Subroutine ZMul(A,B,C,N)
Implicit Real8 (a-h,o-z)
Complex16 A(),B(),C()
Do I=1,N
C(I)=A(I)B(I)
Enddo
Return
Basically, two vectors A and B are assigned values through read statements and passed to a subroutine which will calculate the product of A and B. Notice, that A vectors will be updated and contain the product of A and B finally… I compiled with pgf77 version 7 with option -fast (In version 7, fast will automatically contain the SSE vectorization). And here is the compiling information I got
zmultst:
7, Loop not vectorized: contains call
11, Loop not vectorized: contains call
16, Loop not vectorized: contains call
21, Loop not vectorized: contains call
zmul:
31, Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
I tried this code and no surprise, it works.
Here is the other version
Version 2
Program ZMULtst
Implicit Real8 (a-h,o-z)
Parameter(Nmax=1000)
Complex16 A(Nmax),B(Nmax),C(Nmax)

Read*,N
Do I=1,N
Read*,Dummy1,Dummy2
A(I)=Dcmplx(Dummy1,Dummy2)
Enddo
Do I=1,N
Read*,Dummy1,Dummy2
B(I)=Dcmplx(Dummy1,Dummy2)
Enddo
Write(6,) ‘Vectors A B’
Do I=1,N
Write(6,) A(I),B(I)
Enddo
Call ZMul(A,B,A,N)
Write(6,) ‘After multiplication: Result’
Do I=1,N
Write(6,) A(I)
Enddo
End

Subroutine ZMul(A,B,C,N)
Implicit Real8 (a-h,o-z)
Complex16 A(),B(),C(*),Cdummy

Do I=1,N
cdummy=A(I)*B(I)
C(I)=cdummy
Enddo
Return
End

In this version, a cdummy is defined in the subroutine zmul to assigne the value. compile with -fast(SSE turned on), here is the information,
zmultst:
7, Loop not vectorized: contains call
11, Loop not vectorized: contains call
16, Loop not vectorized: contains call
21, Loop not vectorized: contains call
zmul:
31, Unrolled inner loop 4 times
Generated 2 prefetch instructions for this loop

I tried a few different inputs, and I will give two results. First, I give the program two identical vectors with 4 elements (A and B the same, notice they are all complex vectors). i.e,
A B
1.0d0 1.0d0 1.0d0 1.0d0
2.0d0 2.0d0 2.0d0 2.0d0
3.0d0 3.0d0 3.0d0 3.0d0
4.0d0 4.0d0 4.0d0 4.0d0
the correct answer should be
(0.0d0,2.0d0)
(0.0d0,8.0d0)
(0.0d0,18.0d0)
(0.0d0,32.0d0)
however, the dummy version gives the following,
(-1.000000000000000,2.000000000000000)
(-12.00000000000000,8.000000000000000)
(-45.00000000000000,18.00000000000000)
(0.0000000000000000E+000,32.00000000000000)
notice the real part is messed up!.
I also tried to give two vectors with 8 elements
Vectors A B
(1.000000000000000,1.000000000000000) (1.000000000000000,1.000000000000000)
(2.000000000000000,2.000000000000000) (2.000000000000000,2.000000000000000)
(3.000000000000000,3.000000000000000) (3.000000000000000,3.000000000000000)
(4.000000000000000,4.000000000000000) (4.000000000000000,4.000000000000000)
(5.000000000000000,5.000000000000000) (5.000000000000000,5.000000000000000)
(6.000000000000000,6.000000000000000) (6.000000000000000,6.000000000000000)
(7.000000000000000,7.000000000000000) (7.000000000000000,7.000000000000000)
(8.000000000000000,8.000000000000000) (8.000000000000000,8.000000000000000)
After multiplication: Result
(-1.000000000000000,2.000000000000000)
(-12.00000000000000,8.000000000000000)
(-45.00000000000000,18.00000000000000)
(0.0000000000000000E+000,32.00000000000000)
(-225.0000000000000,50.00000000000000)
(-396.0000000000000,72.00000000000000)
(-637.0000000000000,98.00000000000000)
(0.0000000000000000E+000,128.0000000000000)

Again, the real part is messed up and every 4 element is right. I even tried to translate the two programs to assembly language and I noticed some different in the loop inside zmul but since I’m not so familiar with assembly language, I can’t really figure out what is happening there.
In the two simple examples I gave above,SSE is not necessarily needed, however, we are working on developing our program based on Gaussian 03, and we found out that when compiling our program with -fast option (which is used by default in gaussian 03), the same kind of problem pop out. At first, we thought this is a gaussian bug, but after we tried the two programs separately, we are very sure this is a compiler issue. And if you can give some hint, it will be really helpful.

Xiaohu

MatColgrove · November 4, 2008, 5:09pm

Hi Xiaohu,

It’s illegal to pass an array twice to a subroutine and then have the subroutine modify the array. You call ZMul with A both as an input and output, causing undefined behavior since the compiler is allowed to assume that the two arrays, A and C, are disjoint. To fix, change the ZMul call to use ‘C’ and print ‘C’ as the result.

Example with your original code:

grandcanyon:/tmp% v2.out < in.txt
 Vectors A B
 (1.000000000000000,1.000000000000000) (1.000000000000000,1.000000000000000)
 (2.000000000000000,2.000000000000000) (2.000000000000000,2.000000000000000)
 (3.000000000000000,3.000000000000000) (3.000000000000000,3.000000000000000)
 (4.000000000000000,4.000000000000000) (4.000000000000000,4.000000000000000)
 After multiplication: Result
 (-1.000000000000000,2.000000000000000)
 (-12.00000000000000,8.000000000000000)
 (-45.00000000000000,18.00000000000000)
 (-112.0000000000000,32.00000000000000)

After I update the code to use ‘C’ instead of ‘A’:

grandcanyon:/tmp% cat v2.f
        Program ZMULtst
        Implicit Real*8 (a-h,o-z)
        Parameter(Nmax=1000)
        Complex*16 A(Nmax),B(Nmax),C(Nmax)
        Read*,N
        Do I=1,N
                Read*,Dummy1,Dummy2
                A(I)=Dcmplx(Dummy1,Dummy2)
        Enddo
        Do I=1,N
                Read*,Dummy1,Dummy2
                B(I)=Dcmplx(Dummy1,Dummy2)
        Enddo
        Write(6,*) 'Vectors A B'
        Do I=1,N
                Write(6,*) A(I),B(I)
        Enddo
        Call ZMul(A,B,C,N)
        Write(6,*) 'After multiplication: Result'
        Do I=1,N
                Write(6,*) C(I)
        Enddo
        End

        Subroutine ZMul(A,B,C,N)
        Implicit Real*8 (a-h,o-z)
        Complex*16 A(*),B(*),C(*), Cdummy

        Do I=1,N
                 cdummy=A(I)*B(I)
                C(I)=cdummy
        Enddo
        Return
        END
grandcanyon:/tmp% pgf77 -fast v2.f -Minfo=all -Mneginfo=all -o v2.out
zmultst:
    6, Loop not vectorized/parallelized: contains call
   10, Loop not vectorized/parallelized: contains call
   15, Loop not vectorized/parallelized: contains call
   20, Loop not vectorized/parallelized: contains call
zmul:
   29, Unrolled inner loop 4 times
       Generated 2 prefetch instructions for this loop
grandcanyon:/tmp% v2.out < in.txt
 Vectors A B
 (1.000000000000000,1.000000000000000) (1.000000000000000,1.000000000000000)
 (2.000000000000000,2.000000000000000) (2.000000000000000,2.000000000000000)
 (3.000000000000000,3.000000000000000) (3.000000000000000,3.000000000000000)
 (4.000000000000000,4.000000000000000) (4.000000000000000,4.000000000000000)
 After multiplication: Result
 (0.0000000000000000E+000,2.000000000000000)
 (0.0000000000000000E+000,8.000000000000000)
 (0.0000000000000000E+000,18.00000000000000)
 (0.0000000000000000E+000,32.00000000000000)

Hope this helps,
Mat

xl5 · November 4, 2008, 5:17pm

Hi, Mat,
Passing the same pointer to the subroutine is not illegal. It’s just like a=3, b=4, if you say a=a*b, then a=12, the value will only be updated. In fact, in Gaussian 03, there are a lot of cases where the same pointer is passed to a subroutine, and there are no problem. It also seems that the kind of problem occurs when the arrays are complex, one of my colleague has tried real double precision, with and without dummy variable, they both work. I suspect the flag -fast(SSE) is causing the problem. In the dummy code, if you try to compile without -fast(SSE), the result will be correct.

xl5 · November 4, 2008, 5:28pm

Mat,
I forgot to mention on point. I said -fast will cause the problem in the dummy variable version. Specifically, the problem pops out when -Mvect=sse is specified. Again, regarding to your replace A array with C array, you can see that in the first version with direct assignment, even A and C array are the same pointer, compiling with and without -Mvect=sse will always give right answer. This problem is only present with the dummy variable version and with SSE turned on and with complex arrays. Since as I mentioned in the first post, the compiler gives information that loop is unrolled with -Mvect=sse turned on. But in the direct assignment, with -Mvect=sse turned on, there is a specific information about sse code has been generated.
Hope this helps to clarify the problem.

Xiaohu

MatColgrove · November 4, 2008, 6:11pm

Hi Xiaohu,

It is illegal to associate two dummy arguments with the same actual argument and then have the dummy arguments be “defined” (i.e. assignment) (see http://www.fortran.com/fortran/F77_std/rjcnf0001-sh-15.html#sh-15.9.3.6).

Gaussian 03, as well as other applications such as GAMESS, do violate the Fortran standard in this regard. Gaussian is aware of the issue and hence compiles these subroutines at low optimization (-O1).

Hope this helps,
Mat

xl5 · November 4, 2008, 6:27pm

Mat,
Thanks for your information. In fact, Gaussian is compiled by default using -fast option which is -O2. Before pgi 7.0, -fast means -O2 -Munroll=c:1 -Mnoframe -Mlre. but in pgi version 7.0 or later, -fast adds -Mvect=sse. If this problem is related to referencing the argument A and C to the same argument, but then why real double precision works and also specifying -Mvect=nosse also works?

Xiaohu

MatColgrove · November 4, 2008, 9:32pm

Hi Xiaohu,

There is no guarantee that a real double precision would work either. It happens to be ok in this case.

For complex arithmetic, C = A*B breaks-up into multiple calculations for the real and imaginary portions:

Cr = Ar * Br - Ai * Bi
Ci = Ar * Bi + Ai * Br

But when A and C are the same array, this can turn into

Cr = Ar * Br - Ai * Bi
Ci = Cr * Bi + Ai * Br

depending upon how the intermediary calculations are being stored.

Mat