Offloading vector syntax - offloading using plain standard ISO Fortran

olews · November 24, 2021, 6:35am

The offloading using standard language is a major leap forward when writing portable programs, many codes have lifetime measured in decades.
While do concurrent works nice for loops where indexed variables are addressed element by element it would be beneficial to be able to use vector syntax like A=B+exp(C) where A,B & C are matrices. Or expressions using where, like where (Z>0) A=sqrt(Z).

MatColgrove · November 24, 2021, 5:31pm

While I don’t have any insights into the future direction of the Fortran standard, nvfortran can auto-parallelize array syntax within a do concurrent loop. So the way to do this now is something like the following. Though, no guarantee’s other compilers would follow suit and may just offload this sequentially.

% cat test.f90

program foo

  integer i
  real, dimension(:), allocatable :: A, B, C
  allocate(A(1024),B(1024),C(1024))
  B=1
  C=2
  do concurrent(i=1:1)
    A=B+exp(C)
  end do
  print *, A(1:5)

  deallocate(A,B,C)

end program foo
% nvfortran test.f90 -stdpar -Minfo; a.out
foo:
      8, Memory set idiom, loop replaced by call to __c_mset4
      9, Memory set idiom, loop replaced by call to __c_mset4
     11, Generating NVIDIA GPU code
         10, Loop parallelized across CUDA thread blocks, CUDA threads(128) collapse(2) ! blockidx%x threadidx%x
         11,   ! blockidx%x threadidx%x auto-collapsed
    8.389056        8.389056        8.389056        8.389056
    8.389056

This is roughly equivalent to using OpenACC’s “kernels” directive with managed memory.

% cat test_acc.f90

program foo

  integer i
  real, dimension(:), allocatable :: A, B, C
  allocate(A(1024),B(1024),C(1024))
  B=1
  C=2
!$acc kernels
  A=B+exp(C)
!$acc end kernels
  print *, A(1:5)

  deallocate(A,B,C)

end program foo
% nvfortran test_acc.f90 -acc -Minfo -gpu=managed; a.out
foo:
      8, Memory set idiom, loop replaced by call to __c_mset4
      9, Memory set idiom, loop replaced by call to __c_mset4
     10, Generating implicit copyout(a(1:1024)) [if not already present]
         Generating implicit copyin(c(1:1024),b(1:1024)) [if not already present]
     11, Loop is parallelizable
         Generating NVIDIA GPU code
         11, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
    8.389056        8.389056        8.389056        8.389056
    8.389056

Topic		Replies	Views
Combining stdpar with OpenACC async nvc, nvc++ and nvfortran	1	423	April 27, 2023
Wrong results when using vector clause in parallel loop with array syntax nvc, nvc++ and nvfortran	4	921	February 17, 2023
OpenACC pointer procedure (fortran) nvc, nvc++ and nvfortran	2	37	February 18, 2025
On the correct array syntax to be used in data clauses nvc, nvc++ and nvfortran	2	731	February 10, 2022
Fortran OpenACC array reduction nvc, nvc++ and nvfortran	7	806	September 13, 2022
Nvfortran -stdpar triggers OpenACC directives to be evaluated nvc, nvc++ and nvfortran	1	20	November 22, 2024
Openacc routine directive nvc, nvc++ and nvfortran	3	672	March 27, 2024
Accelerating Fortran DO CONCURRENT with GPUs and the NVIDIA HPC SDK Technical Blog	28	2404	February 25, 2025
OpenACC routine behavior nvfortran nvc, nvc++ and nvfortran	4	26	April 11, 2025
Fortran DO CONCURRENT REDUCE Implementation help nvc, nvc++ and nvfortran	11	1293	March 3, 2025

Offloading vector syntax - offloading using plain standard ISO Fortran

Related topics