Offloading vector syntax - offloading using plain standard ISO Fortran

The offloading using standard language is a major leap forward when writing portable programs, many codes have lifetime measured in decades.
While do concurrent works nice for loops where indexed variables are addressed element by element it would be beneficial to be able to use vector syntax like A=B+exp(C) where A,B & C are matrices. Or expressions using where, like where (Z>0) A=sqrt(Z).

1 Like

While I don’t have any insights into the future direction of the Fortran standard, nvfortran can auto-parallelize array syntax within a do concurrent loop. So the way to do this now is something like the following. Though, no guarantee’s other compilers would follow suit and may just offload this sequentially.

% cat test.f90

program foo

  integer i
  real, dimension(:), allocatable :: A, B, C
  allocate(A(1024),B(1024),C(1024))
  B=1
  C=2
  do concurrent(i=1:1)
    A=B+exp(C)
  end do
  print *, A(1:5)

  deallocate(A,B,C)

end program foo
% nvfortran test.f90 -stdpar -Minfo; a.out
foo:
      8, Memory set idiom, loop replaced by call to __c_mset4
      9, Memory set idiom, loop replaced by call to __c_mset4
     11, Generating NVIDIA GPU code
         10, Loop parallelized across CUDA thread blocks, CUDA threads(128) collapse(2) ! blockidx%x threadidx%x
         11,   ! blockidx%x threadidx%x auto-collapsed
    8.389056        8.389056        8.389056        8.389056
    8.389056

This is roughly equivalent to using OpenACC’s “kernels” directive with managed memory.

% cat test_acc.f90

program foo

  integer i
  real, dimension(:), allocatable :: A, B, C
  allocate(A(1024),B(1024),C(1024))
  B=1
  C=2
!$acc kernels
  A=B+exp(C)
!$acc end kernels
  print *, A(1:5)

  deallocate(A,B,C)

end program foo
% nvfortran test_acc.f90 -acc -Minfo -gpu=managed; a.out
foo:
      8, Memory set idiom, loop replaced by call to __c_mset4
      9, Memory set idiom, loop replaced by call to __c_mset4
     10, Generating implicit copyout(a(1:1024)) [if not already present]
         Generating implicit copyin(c(1:1024),b(1:1024)) [if not already present]
     11, Loop is parallelizable
         Generating NVIDIA GPU code
         11, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
    8.389056        8.389056        8.389056        8.389056
    8.389056
1 Like