Is there a way to vectorize this routine?

Hi all,

I am tring to optimize the fowllowing routine:

      PROGRAM TEST
      IMPLICIT NONE

      integer nx,ny,nz,nu
      parameter(nx=128)
      parameter(ny=128)
      parameter(nz=128)
      parameter(nu=68)

      real*4 a,b,c
      integer d
      dimension a(nz,ny,nx)
      dimension b(nz,ny,nx)
      dimension c(nu)
      dimension d(nz,ny,nx)

      integer i,j,k
      integer index

      do i=1,nx
         do j=1,ny
            do k=1,nz
               index = d(k,j,i)
               a(k,j,i) = a(k,j,i)+c(index)*b(k,j,i)
            enddo
         enddo
      enddo

      return
      end

I tried -Mvect=sse option. The optimization inforamtion says:
% pgf95 -Mvect=sse -Minfo test.f
test:
22, Unrolled inner loop 8 times
Generated 3 prefetch instructions for this loop

Is there a way to vectorize it? Thanks!

tangoman,

Yes and no. I’ll tell you two ways to vectorize it, but I can’t guarantee that either will be practical. I actually spent some time trying to vectorize something similar, and the best I did was break-even. (On an AMD. I haven’t tried it on an Intel. And I’m not convinced they’re equal for SSE stuff.)

That being said, if someone else knows how to do it more effectively, I’d love to find out.

The problem, as I see it, is that c' is out-of-order relative to a’ and `b’. Stuff only vectorizes well if it’s in order (so you can interact the 4x32-bit registers usefully). Two ways to vectorize it (both of which, I suspect, will contribute too much overhead):

1.) Invert d' to create a map from nu’ to nz,ny,nx'. Then re-order a’ and b' so that they are arranged to correspond with increasing nu’. If you have to explicitly re-order a' and d’ after they’re filled, this is probably hoplessly expensive. However, if you can re-order the initial filling of a' and d’ this can actually work.

2.) ``Explode’’ c' into a nz*ny*nx array, replicating the values as required (nu’ << nznynx). You’re actually doing that on-the-fly with c(index). But if you spend the extra resources pre-constructing it, you can vectorize trivially because a' and b’ and `c’ will all be the same size and the same order. That might even vectorize automatically in the compiler if you did the pre-work.

Option (1) will, quite-likely, require you to use intrinsics or inline assembly. Intrinsics are not too bad, but they’re only included in the C/C++ compilers, so you will have to call a C function to implement them. It’s not as bad as it sounds. I’ve never used inline assembly, so I can’t say much about it, except that I never want to use it.

Good luck,

-Matt

We will vectorize loops with indirect addressing in PGI 7.1. If you can wait a few weeks :) YOu can write it as
a(k,j,i) = c(d(k,j,i))*b(k,j,i)
and it will vectorize. (Actually it will vectorize the way you have it there…)

Thank MattD and Brentl for your kind reply. Looking forward to this feature coming soon. :-)

Another question. Is there a way to vectorize reduction operation? Somehing like this:
[/code]
do i=1, 1024
sum = sum+a(i)
enddo

Thanks!

tangoman,

The current version of pgf90 seems to vectorize that operation automatically. If your version doesn’t vectorize that operation yet, you can also try the SUM(array) function for some improvement (and cleaner code).

Please see the following test code and outputs.

program test

 integer :: ii
 real :: a(1024), sum1, sum2
 double precision :: start, finish, t1, t2

 do ii = 1 , 1024
  a(ii) = sin(2*3.1416*real(ii)/1024.0)
 enddo

 t1 = 0
 do jj = 1 , 1000000
  sum1 = 0

  call cpu_time(start)
  do ii = 1 , 1024
   sum1 = sum1 + a(ii)
  enddo
  call cpu_time(finish)

  t1 = t1 + (finish-start)
 enddo

 write(*,*) 'Loop took ',t1,' microseconds'
 write(*,*) 'Error = ',sum1

 t2 = 0
 do jj = 1 , 1000000
  sum2 = 0

  call cpu_time(start)
  sum2 = sum2 + sum(a)
  call cpu_time(finish)

  t2 = t2 + (finish-start)
 enddo

 write(*,*) 'SUM function took ',t2,' microseconds'
 write(*,*) 'Error = ',sum2

end program

Compilation and output:
[delaquil@head TEMP]$ pgf90 -V

pgf90 7.0-7 64-bit target on x86-64 Linux
Copyright 1989-2000, The Portland Group, Inc. All Rights Reserved.
Copyright 2000-2007, STMicroelectronics, Inc. All Rights Reserved.
[delaquil@head TEMP]$ pgf90 -Mfree -fastsse -Mvect=nosse -Minfo test.f
test:
32, sum reduction inlined
test:
16, Loop unrolled 8 times
32, Loop unrolled 8 times
[delaquil@head TEMP]$ ./a.out
Loop took 4.492819070816040 microseconds
Error = 2.5787554E-04
SUM function took 3.693236351013184 microseconds
Error = 2.5787554E-04
[delaquil@head TEMP]$ pgf90 -Mfree -fastsse -Minfo test.f
test:
32, sum reduction inlined
test:
16, Generated vector sse code for inner loop
Generated 1 prefetch instructions for this loop
32, Generated vector sse code for inner loop
Generated 1 prefetch instructions for this loop
[delaquil@head TEMP]$ ./a.out
Loop took 2.365142345428467 microseconds
Error = 5.6102628E-05
SUM function took 2.222937107086182 microseconds
Error = 5.6102628E-05
[delaquil@head TEMP]$

-Matt

MattD,
Thanks. I used PGI 6.1 on SuSE 9.2 x86-64 Linux.
% pgf90 -V

pgf90 6.1-2 64-bit target on x86-64 Linux
Copyright 1989-2000, The Portland Group, Inc. All Rights Reserved.
Copyright 2000-2005, STMicroelectronics, Inc. All Rights Reserved.

The follwing test code failed to vectorize.

      PROGRAM TEST
      IMPLICIT NONE

      integer nx,ny,nz
      parameter(nx=128)
      parameter(ny=128)
      parameter(nz=128)

      complex a,b,sum
      dimension a(nz,ny,nx)
      dimension b(nz,ny,nx)
      integer i,j,k

      do i=1,nx
         do j=1,ny
            do k=1,nz
               sum = sum+a(k,j,i)*b(k,j,i)
            enddo
         enddo
      enddo

      return
      end

% pgf90 -O2 -Mvect=sse -Minfo test.f
test:
16, Unrolled inner loop 8 times
Generated 1 prefetch instructions for this loop

Any suggestion? Thanks!

tangoman,

I tried the code you posted, and I get the same optimization report that you listed (no vectorization).

However, if I change COMPLEX to REAL (for a, b, and sum), it does vectorize.

So it seems that the compiler does not auto-vectorize complex math in this example. I can’t think offhand why it wouldn’t, though.

You could vectorize it yourself if the performance was really critical. It shouldn’t be that tough for this example (as opposed to the problem you presented in the first post in this thread).

You may also get better performance from using SUM rather than loops. I haven’t timed it for this example, but it’s usually better optomized:

sum1 = sum1 + sum(sum(sum(a*b,1),1),1)

Also, you must rename your `sum’ variable if you want to use the SUM function. Array operations are nice. Use them when possible. (They also tend to have a better chance of auto-vectorizing than explicitly written loops.)

Good luck,

-Matt