 # Is there a way to vectorize this routine?

Hi all,

I am tring to optimize the fowllowing routine:

``````      PROGRAM TEST
IMPLICIT NONE

integer nx,ny,nz,nu
parameter(nx=128)
parameter(ny=128)
parameter(nz=128)
parameter(nu=68)

real*4 a,b,c
integer d
dimension a(nz,ny,nx)
dimension b(nz,ny,nx)
dimension c(nu)
dimension d(nz,ny,nx)

integer i,j,k
integer index

do i=1,nx
do j=1,ny
do k=1,nz
index = d(k,j,i)
a(k,j,i) = a(k,j,i)+c(index)*b(k,j,i)
enddo
enddo
enddo

return
end
``````

I tried -Mvect=sse option. The optimization inforamtion says:
% pgf95 -Mvect=sse -Minfo test.f
test:
22, Unrolled inner loop 8 times
Generated 3 prefetch instructions for this loop

Is there a way to vectorize it? Thanks!

tangoman,

Yes and no. I’ll tell you two ways to vectorize it, but I can’t guarantee that either will be practical. I actually spent some time trying to vectorize something similar, and the best I did was break-even. (On an AMD. I haven’t tried it on an Intel. And I’m not convinced they’re equal for SSE stuff.)

That being said, if someone else knows how to do it more effectively, I’d love to find out.

The problem, as I see it, is that `c' is out-of-order relative to `a’ and `b’. Stuff only vectorizes well if it’s in order (so you can interact the 4x32-bit registers usefully). Two ways to vectorize it (both of which, I suspect, will contribute too much overhead):

1.) Invert `d' to create a map from `nu’ to `nz,ny,nx'. Then re-order `a’ and `b' so that they are arranged to correspond with increasing `nu’. If you have to explicitly re-order `a' and `d’ after they’re filled, this is probably hoplessly expensive. However, if you can re-order the initial filling of `a' and `d’ this can actually work.

2.) ``Explode’’ `c' into a nz*ny*nx array, replicating the values as required (`nu’ << nznynx). You’re actually doing that on-the-fly with c(index). But if you spend the extra resources pre-constructing it, you can vectorize trivially because `a' and `b’ and `c’ will all be the same size and the same order. That might even vectorize automatically in the compiler if you did the pre-work.

Option (1) will, quite-likely, require you to use intrinsics or inline assembly. Intrinsics are not too bad, but they’re only included in the C/C++ compilers, so you will have to call a C function to implement them. It’s not as bad as it sounds. I’ve never used inline assembly, so I can’t say much about it, except that I never want to use it.

Good luck,

-Matt

We will vectorize loops with indirect addressing in PGI 7.1. If you can wait a few weeks :) YOu can write it as
a(k,j,i) = c(d(k,j,i))*b(k,j,i)
and it will vectorize. (Actually it will vectorize the way you have it there…)

Thank MattD and Brentl for your kind reply. Looking forward to this feature coming soon. :-)

Another question. Is there a way to vectorize reduction operation? Somehing like this:
[/code]
do i=1, 1024
sum = sum+a(i)
enddo

``````Thanks!
``````

tangoman,

The current version of pgf90 seems to vectorize that operation automatically. If your version doesn’t vectorize that operation yet, you can also try the SUM(array) function for some improvement (and cleaner code).

Please see the following test code and outputs.

``````program test

integer :: ii
real :: a(1024), sum1, sum2
double precision :: start, finish, t1, t2

do ii = 1 , 1024
a(ii) = sin(2*3.1416*real(ii)/1024.0)
enddo

t1 = 0
do jj = 1 , 1000000
sum1 = 0

call cpu_time(start)
do ii = 1 , 1024
sum1 = sum1 + a(ii)
enddo
call cpu_time(finish)

t1 = t1 + (finish-start)
enddo

write(*,*) 'Loop took ',t1,' microseconds'
write(*,*) 'Error = ',sum1

t2 = 0
do jj = 1 , 1000000
sum2 = 0

call cpu_time(start)
sum2 = sum2 + sum(a)
call cpu_time(finish)

t2 = t2 + (finish-start)
enddo

write(*,*) 'SUM function took ',t2,' microseconds'
write(*,*) 'Error = ',sum2

end program
``````

Compilation and output:

pgf90 7.0-7 64-bit target on x86-64 Linux
[delaquil@head TEMP]\$ pgf90 -Mfree -fastsse -Mvect=nosse -Minfo test.f
test:
32, sum reduction inlined
test:
16, Loop unrolled 8 times
32, Loop unrolled 8 times
Loop took 4.492819070816040 microseconds
Error = 2.5787554E-04
SUM function took 3.693236351013184 microseconds
Error = 2.5787554E-04
[delaquil@head TEMP]\$ pgf90 -Mfree -fastsse -Minfo test.f
test:
32, sum reduction inlined
test:
16, Generated vector sse code for inner loop
Generated 1 prefetch instructions for this loop
32, Generated vector sse code for inner loop
Generated 1 prefetch instructions for this loop
Loop took 2.365142345428467 microseconds
Error = 5.6102628E-05
SUM function took 2.222937107086182 microseconds
Error = 5.6102628E-05

-Matt

MattD,
Thanks. I used PGI 6.1 on SuSE 9.2 x86-64 Linux.
% pgf90 -V

pgf90 6.1-2 64-bit target on x86-64 Linux

The follwing test code failed to vectorize.

``````      PROGRAM TEST
IMPLICIT NONE

integer nx,ny,nz
parameter(nx=128)
parameter(ny=128)
parameter(nz=128)

complex a,b,sum
dimension a(nz,ny,nx)
dimension b(nz,ny,nx)
integer i,j,k

do i=1,nx
do j=1,ny
do k=1,nz
sum = sum+a(k,j,i)*b(k,j,i)
enddo
enddo
enddo

return
end
``````

% pgf90 -O2 -Mvect=sse -Minfo test.f
test:
16, Unrolled inner loop 8 times
Generated 1 prefetch instructions for this loop

Any suggestion? Thanks!

tangoman,

I tried the code you posted, and I get the same optimization report that you listed (no vectorization).

However, if I change COMPLEX to REAL (for a, b, and sum), it does vectorize.

So it seems that the compiler does not auto-vectorize complex math in this example. I can’t think offhand why it wouldn’t, though.

You could vectorize it yourself if the performance was really critical. It shouldn’t be that tough for this example (as opposed to the problem you presented in the first post in this thread).

You may also get better performance from using SUM rather than loops. I haven’t timed it for this example, but it’s usually better optomized:

sum1 = sum1 + sum(sum(sum(a*b,1),1),1)

Also, you must rename your `sum’ variable if you want to use the SUM function. Array operations are nice. Use them when possible. (They also tend to have a better chance of auto-vectorizing than explicitly written loops.)

Good luck,

-Matt