We recently encountered a problem when developing our code. I will post my code example to illustrate the problem. The basis problem is that giving two double precision complex vector, we want to get the product of the individual elements of the two vectors and we have two versions of the code, one is done through direct assignment and the other one is done through dummy variables. In the examples given below, they are really basic programs, this kind of problem is related to a big program which is developed from Gaussian 03. So please be patient if you see anything very simple and maybe naive, you will find out that this will not be so simple in the end. Here is the code

**Version 1**

Program ZMULtst

Implicit Real*8 (a-h,o-z)
Parameter(Nmax=1000)
Complex*16 A(Nmax),B(Nmax),C(Nmax)

Read*,N

Do I=1,N

Read*,Dummy1,Dummy2

A(I)=Dcmplx(Dummy1,Dummy2)

Enddo

Do I=1,N

Read*,Dummy1,Dummy2

B(I)=Dcmplx(Dummy1,Dummy2)

Enddo

Write(6,

*) ‘Vectors A B’*

Do I=1,N

Write(6,) A(I),B(I)

Do I=1,N

Write(6,

Enddo

Call ZMul(A,B,A,N)

Write(6,

*) ‘After multiplication: Result’*

Do I=1,N

Write(6,) A(I)

Do I=1,N

Write(6,

Enddo

End

Subroutine ZMul(A,B,C,N)

Implicit Real

*8 (a-h,o-z)*

Complex16 A(

Complex

*),B(*),C(

*)*

Do I=1,N

C(I)=A(I)

Parameter(Nmax=1000)

Complex16 A(Nmax),B(Nmax),C(Nmax)

Do I=1,N

C(I)=A(I)

*B(I)*

Enddo

Return

Basically, two vectors A and B are assigned values through read statements and passed to a subroutine which will calculate the product of A and B. Notice, that A vectors will be updated and contain the product of A and B finally… I compiled with pgf77 version 7 with option -fast (In version 7, fast will automatically contain the SSE vectorization). And here is the compiling information I got

zmultst:

7, Loop not vectorized: contains call

11, Loop not vectorized: contains call

16, Loop not vectorized: contains call

21, Loop not vectorized: contains call

zmul:

31, Generated vector sse code for inner loop

Generated 2 prefetch instructions for this loop

I tried this code and no surprise, it works.

Here is the other version

Program ZMULtst

Implicit Real8 (a-h,o-z)Enddo

Return

Basically, two vectors A and B are assigned values through read statements and passed to a subroutine which will calculate the product of A and B. Notice, that A vectors will be updated and contain the product of A and B finally… I compiled with pgf77 version 7 with option -fast (In version 7, fast will automatically contain the SSE vectorization). And here is the compiling information I got

zmultst:

7, Loop not vectorized: contains call

11, Loop not vectorized: contains call

16, Loop not vectorized: contains call

21, Loop not vectorized: contains call

zmul:

31, Generated vector sse code for inner loop

Generated 2 prefetch instructions for this loop

I tried this code and no surprise, it works.

Here is the other version

**Version 2**Program ZMULtst

Implicit Real

Parameter(Nmax=1000)

Complex

Read*,N

Do I=1,N

Read*,Dummy1,Dummy2

A(I)=Dcmplx(Dummy1,Dummy2)

Enddo

Do I=1,N

Read*,Dummy1,Dummy2

B(I)=Dcmplx(Dummy1,Dummy2)

Enddo

Write(6,*) ‘Vectors A B’
Do I=1,N
Write(6,*) A(I),B(I)

Enddo

Call ZMul(A,B,A,N)

Write(6,

*) ‘After multiplication: Result’*

Do I=1,N

Write(6,) A(I)

Do I=1,N

Write(6,

Enddo

End

Subroutine ZMul(A,B,C,N)

Implicit Real

*8 (a-h,o-z)*

Complex16 A(

Complex

*),B(*),C(*),Cdummy

Do I=1,N

cdummy=A(I)*B(I)

C(I)=cdummy

Enddo

Return

End

In this version, a cdummy is defined in the subroutine zmul to assigne the value. compile with -fast(SSE turned on), here is the information,

zmultst:

7, Loop not vectorized: contains call

11, Loop not vectorized: contains call

16, Loop not vectorized: contains call

21, Loop not vectorized: contains call

zmul:

31, Unrolled inner loop 4 times

Generated 2 prefetch instructions for this loop

I tried a few different inputs, and I will give two results. First, I give the program two identical vectors with 4 elements (A and B the same, notice they are all complex vectors). i.e,

A B

1.0d0 1.0d0 1.0d0 1.0d0

2.0d0 2.0d0 2.0d0 2.0d0

3.0d0 3.0d0 3.0d0 3.0d0

4.0d0 4.0d0 4.0d0 4.0d0

the correct answer should be

(0.0d0,2.0d0)

(0.0d0,8.0d0)

(0.0d0,18.0d0)

(0.0d0,32.0d0)

however, the dummy version gives the following,

(-1.000000000000000,2.000000000000000)

(-12.00000000000000,8.000000000000000)

(-45.00000000000000,18.00000000000000)

(0.0000000000000000E+000,32.00000000000000)

notice the real part is messed up!.

I also tried to give two vectors with 8 elements

Vectors A B

(1.000000000000000,1.000000000000000) (1.000000000000000,1.000000000000000)

(2.000000000000000,2.000000000000000) (2.000000000000000,2.000000000000000)

(3.000000000000000,3.000000000000000) (3.000000000000000,3.000000000000000)

(4.000000000000000,4.000000000000000) (4.000000000000000,4.000000000000000)

(5.000000000000000,5.000000000000000) (5.000000000000000,5.000000000000000)

(6.000000000000000,6.000000000000000) (6.000000000000000,6.000000000000000)

(7.000000000000000,7.000000000000000) (7.000000000000000,7.000000000000000)

(8.000000000000000,8.000000000000000) (8.000000000000000,8.000000000000000)

After multiplication: Result

(-1.000000000000000,2.000000000000000)

(-12.00000000000000,8.000000000000000)

(-45.00000000000000,18.00000000000000)

(0.0000000000000000E+000,32.00000000000000)

(-225.0000000000000,50.00000000000000)

(-396.0000000000000,72.00000000000000)

(-637.0000000000000,98.00000000000000)

(0.0000000000000000E+000,128.0000000000000)

Again, the real part is messed up and every 4 element is right. I even tried to translate the two programs to assembly language and I noticed some different in the loop inside zmul but since I’m not so familiar with assembly language, I can’t really figure out what is happening there.

In the two simple examples I gave above,SSE is not necessarily needed, however, we are working on developing our program based on Gaussian 03, and we found out that when compiling our program with -fast option (which is used by default in gaussian 03), the same kind of problem pop out. At first, we thought this is a gaussian bug, but after we tried the two programs separately, we are very sure this is a compiler issue. And if you can give some hint, it will be really helpful.

Xiaohu