We recently encountered a problem when developing our code. I will post my code example to illustrate the problem. The basis problem is that giving two double precision complex vector, we want to get the product of the individual elements of the two vectors and we have two versions of the code, one is done through direct assignment and the other one is done through dummy variables. In the examples given below, they are really basic programs, this kind of problem is related to a big program which is developed from Gaussian 03. So please be patient if you see anything very simple and maybe naive, you will find out that this will not be so simple in the end. Here is the code
Version 1
Program ZMULtst
Implicit Real8 (a-h,o-z)
Parameter(Nmax=1000)
Complex16 A(Nmax),B(Nmax),C(Nmax)
Read*,N
Do I=1,N
Read*,Dummy1,Dummy2
A(I)=Dcmplx(Dummy1,Dummy2)
Enddo
Do I=1,N
Read*,Dummy1,Dummy2
B(I)=Dcmplx(Dummy1,Dummy2)
Enddo
Write(6,) ‘Vectors A B’
Do I=1,N
Write(6,) A(I),B(I)
Enddo
Call ZMul(A,B,A,N)
Write(6,) ‘After multiplication: Result’
Do I=1,N
Write(6,) A(I)
Enddo
End
Subroutine ZMul(A,B,C,N)
Implicit Real8 (a-h,o-z)
Complex16 A(),B(),C()
Do I=1,N
C(I)=A(I)B(I)
Enddo
Return
Basically, two vectors A and B are assigned values through read statements and passed to a subroutine which will calculate the product of A and B. Notice, that A vectors will be updated and contain the product of A and B finally… I compiled with pgf77 version 7 with option -fast (In version 7, fast will automatically contain the SSE vectorization). And here is the compiling information I got
zmultst:
7, Loop not vectorized: contains call
11, Loop not vectorized: contains call
16, Loop not vectorized: contains call
21, Loop not vectorized: contains call
zmul:
31, Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
I tried this code and no surprise, it works.
Here is the other version
Version 2
Program ZMULtst
Implicit Real8 (a-h,o-z)
Parameter(Nmax=1000)
Complex16 A(Nmax),B(Nmax),C(Nmax)
Read*,N
Do I=1,N
Read*,Dummy1,Dummy2
A(I)=Dcmplx(Dummy1,Dummy2)
Enddo
Do I=1,N
Read*,Dummy1,Dummy2
B(I)=Dcmplx(Dummy1,Dummy2)
Enddo
Write(6,) ‘Vectors A B’
Do I=1,N
Write(6,) A(I),B(I)
Enddo
Call ZMul(A,B,A,N)
Write(6,) ‘After multiplication: Result’
Do I=1,N
Write(6,) A(I)
Enddo
End
Subroutine ZMul(A,B,C,N)
Implicit Real8 (a-h,o-z)
Complex16 A(),B(),C(*),Cdummy
Do I=1,N
cdummy=A(I)*B(I)
C(I)=cdummy
Enddo
Return
End
In this version, a cdummy is defined in the subroutine zmul to assigne the value. compile with -fast(SSE turned on), here is the information,
zmultst:
7, Loop not vectorized: contains call
11, Loop not vectorized: contains call
16, Loop not vectorized: contains call
21, Loop not vectorized: contains call
zmul:
31, Unrolled inner loop 4 times
Generated 2 prefetch instructions for this loop
I tried a few different inputs, and I will give two results. First, I give the program two identical vectors with 4 elements (A and B the same, notice they are all complex vectors). i.e,
A B
1.0d0 1.0d0 1.0d0 1.0d0
2.0d0 2.0d0 2.0d0 2.0d0
3.0d0 3.0d0 3.0d0 3.0d0
4.0d0 4.0d0 4.0d0 4.0d0
the correct answer should be
(0.0d0,2.0d0)
(0.0d0,8.0d0)
(0.0d0,18.0d0)
(0.0d0,32.0d0)
however, the dummy version gives the following,
(-1.000000000000000,2.000000000000000)
(-12.00000000000000,8.000000000000000)
(-45.00000000000000,18.00000000000000)
(0.0000000000000000E+000,32.00000000000000)
notice the real part is messed up!.
I also tried to give two vectors with 8 elements
Vectors A B
(1.000000000000000,1.000000000000000) (1.000000000000000,1.000000000000000)
(2.000000000000000,2.000000000000000) (2.000000000000000,2.000000000000000)
(3.000000000000000,3.000000000000000) (3.000000000000000,3.000000000000000)
(4.000000000000000,4.000000000000000) (4.000000000000000,4.000000000000000)
(5.000000000000000,5.000000000000000) (5.000000000000000,5.000000000000000)
(6.000000000000000,6.000000000000000) (6.000000000000000,6.000000000000000)
(7.000000000000000,7.000000000000000) (7.000000000000000,7.000000000000000)
(8.000000000000000,8.000000000000000) (8.000000000000000,8.000000000000000)
After multiplication: Result
(-1.000000000000000,2.000000000000000)
(-12.00000000000000,8.000000000000000)
(-45.00000000000000,18.00000000000000)
(0.0000000000000000E+000,32.00000000000000)
(-225.0000000000000,50.00000000000000)
(-396.0000000000000,72.00000000000000)
(-637.0000000000000,98.00000000000000)
(0.0000000000000000E+000,128.0000000000000)
Again, the real part is messed up and every 4 element is right. I even tried to translate the two programs to assembly language and I noticed some different in the loop inside zmul but since I’m not so familiar with assembly language, I can’t really figure out what is happening there.
In the two simple examples I gave above,SSE is not necessarily needed, however, we are working on developing our program based on Gaussian 03, and we found out that when compiling our program with -fast option (which is used by default in gaussian 03), the same kind of problem pop out. At first, we thought this is a gaussian bug, but after we tried the two programs separately, we are very sure this is a compiler issue. And if you can give some hint, it will be really helpful.
Xiaohu