MAXVAL with PGI Accelerator

Hello

A) Is there a possibility to search for the maximum value in an array using the PGI Accelerator model?
Code looks like:

MAX_VALUE = maxval(ARRAY(1:1000000))

B) And another question: If I have a loop that is not well suited for the GPU, but still I put the “!$acc region” clause around it and the compiler tells me that the loop is executed sequentially: “!$acc do seq”… is the loop executed on the host or device? Why do I still get performance increase compared to execution without the accelerator region? Does a loop that sequentially executes on the device is faster than if it sequentially executes on the host???

C) Can the compiler and the GPU deal “good” with a loop that looks for example like that:

do i=1,1000000
   A1 = Array(kc,1) + Array(kc,6) + ...
   A2 = Array(kc,2) + ...

   B = 1./A1
   C = B*3.25

   K(i)=B*C*.....

end do

I mean is it bad or ok to generate such “private” variables like A1, B,… that are used later in the loop.
If I would add an index to all of my variables I would exceed the memory of the GPU…

D) Is there a good strategy to reduce memory usage? I did some test cases where I dynamically allocated memory using the “real, dimension(:,:), allocatable, device :: x” clause for some arrays that will be only used temporary during calculation and dont have to be copyied into or out of the device. I explored that if I allocate during execution it is much faster than having a static allocated array (only used local on the GPU). But however, the deallocation of the array on the GPU was the bottleneck and slowed down my performance (wose than running code on CPU).
How can I also fastly deallocate a local array… Otherwise I have to use it for several purposes (overwrite it) which would make my code completely unreadable for others… ;(
Any tipps on that… since I am facing memory space problem…


THANK YOU VERY MUCH IN ADVANCE

Hi elephant,

A) Is there a possibility to search for the maximum value in an array using the PGI Accelerator model?

Yes. The compiler is able to recognize and parallelize most reductions including maxval. Here’s an example:

% cat maxval.f90 
program testmaxval

  real, allocatable, dimension(:) :: arr
  real :: maxv
  integer :: i, N

  N = 10000000
  allocate(arr(N))

!$acc region local(arr)
  do i=1,N
     arr(i) = real(i)/N
  end do
  maxv = maxval(arr)
!$acc end region

  print *, maxv
  deallocate(arr)
end program testmaxval  
    
% pgf90 maxval.f90 -Minfo=accel -ta=nvidia
testmaxval:
     10, Generating local(arr(:))
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     11, Loop is parallelizable
         Accelerator kernel generated
         11, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 1.0 : 3 registers; 40 shared, 12 constant, 0 local memory bytes; 100% occupancy
             CC 1.3 : 3 registers; 40 shared, 12 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 8 registers; 8 shared, 52 constant, 0 local memory bytes; 100% occupancy
     14, Loop is parallelizable
         Accelerator kernel generated
         14, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             Max reduction generated for arr$r      <<<<< HERE Compiler is parallelizing maxval
             CC 1.0 : 6 registers; 1080 shared, 20 constant, 0 local memory bytes; 100% occupancy
             CC 1.3 : 6 registers; 1080 shared, 20 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 8 registers; 1032 shared, 64 constant, 0 local memory bytes; 100% occupancy
% a.out
    1.000000



“!$acc do seq”… is the loop executed on the host or device?

Device, albeit very slowly since it will use only a single thread.

Why do I still get performance increase compared to execution without the accelerator region?

If this is the only accelerator routine, then I don’t know since I would expect it to run slowly. I would need an example to better understand why.

If there are other parallelized routines that use the same data, then sequential routines can still help improve performance by eliminating the need to copy data to/from the host.

Does a loop that sequentially executes on the device is faster than if it sequentially executes on the host?

Comparing just a single threaded, sequential kernel with a host side counter part, I would expect the host to run much faster. Besides the overhead of copying data the the device, the individual thread processors on the device are usually much slower. Again, I would need an example to understand why you are seeing the contrary.

Can the compiler and the GPU deal “good” with a loop that looks for example like that:

It should.

I mean is it bad or ok to generate such “private” variables like A1, B,… that are used later in the loop. If I would add an index to all of my variables I would exceed the memory of the GPU…

Scalars are privatized by default and in general are a good thing since they can be stored in registers which are quick to access. The only caveat is if you use too many registers, you must reduce the number of threads per block.

D) Is there a good strategy to reduce memory usage?

We just added a new feature in 11.6 which allows for shared memory automatic arrays. This will allow you to define at runtime the size in bytes of the total amount of shared memory, as the third argument of the kernel launch, which is then mapped to automatics in the kernel. Here’s an example:

% cat automatic.cuf 
module m
contains
 attributes(global) subroutine ss1( a, b, k )
  implicit none
  integer, value :: k
  integer, dimension(:,:) :: a, b
  integer(8), shared, dimension(blockdim%x, blockdim%y) :: s1
  integer, shared, dimension(k+1, k) :: s2

  integer ti, tj, i, j

  ti = threadidx%x
  tj = threadidx%y
  i = (blockidx%x-1)*blockdim%x + ti
  j = (blockidx%y-1)*blockdim%y + tj

  s1(ti,tj) = a(i,j)
  s2(ti,tj) = b(i,j)
  call syncthreads()
  a(i,j) = s2(tj,ti)
  b(i,j) = s1(tj,ti)
 end subroutine

end module

program p
 use m
 use cudafor
 implicit none

 integer, dimension(:,:), allocatable, device :: da, db
 integer, dimension(:,:), allocatable :: ha0, hb0, ha1, hb1, haa, hbb
 type(dim3) :: grid,block

 integer :: n, i, j, ierr
 n = 128
 allocate(haa(n,n), hbb(n,n))
 allocate(ha0(n,n), hb0(n,n))
 allocate(ha1(n,n), hb1(n,n))
 allocate(da(n,n), db(n,n))

 do j = 1,n
  do i = 1,n
   haa(i,j) = i*1000 + j
   hbb(i,j) = -i*1000 - j
  enddo
 enddo

 da = haa
 db = hbb

 grid = dim3(16,16,1)
 block = dim3(8,8,1)
 print *, 'calling ss1'

! The third argument, '900' is the size in bytes  
! of the shared memory segment which maps to the
! kernel's shared automatics
 call ss1<<<grid,block,900>>>( da, db, 8 )
 print *, 'back from ss1'
 ierr = cudathreadsynchronize()

 if( ierr .ne. 0 )then
  print *,  'ss1 launch'
  print *,  cudageterrorstring( ierr )
 endif
 print *,'last error is ', ierr

 ha1 = da
 hb1 = db
  
end program

% pgf90 -fast automatic.cuf; a.out
 calling ss1
 back from ss1
 last error is             0

Hope this helps,
Mat

Hi Mat,

Still working with this maxval thing! Can you tell me why this example is not working?

      program main
      
      use accel_lib
      implicit none

      integer :: k, i
      integer, parameter :: kend=1000000
      double precision   :: MAXv
      double precision,allocatable,dimension(:,:) :: Arr
!$acc mirror(Arr)
      double precision,allocatable,dimension(:)   :: MX_GPU
!$acc mirror(MX_GPU)
      
      allocate(Arr(kend,7))
      allocate(MX_GPU(kend))
      
      do i=1,7
         do k=1,kend
            Arr(k,i) = sin(k*1.0)+cos(i*1.0)+2.1000000
         end do
      end do
      
!$acc update device(Arr)
 
 
      
!$acc region
       DO K = 1, KEND
          MX_GPU(K) = 1./Arr(K,7)   
       END DO 
       MAXv = maxval(MX_GPU)
!$acc end region


      print*
      print*,'MaxValue correct =    0.5394028083693713'
      print*,'MaxValue         = ',MAXv
      
      end program main

When I run it on the CPU it works fine. But on the GPU not:

 MaxValue correct =    0.5394028083693713
 MaxValue         =   -6.8642848977706330E-145

Thank you

Hi elephant,

Looks like there is a problem with using a ‘mirror’ array in MAXVAL when contained in a region having multiple kernels. I created a report (TPR#18009) and sent it to our engineers for further investigation.

The work around is to put the MAXVAL in it’s own ACC region or declare MX_GPU as a ‘local’ instead of ‘mirror’.

Example using multiple ACC regions:

% cat maxv2.f90

      program main
     
      use accel_lib
      implicit none

      integer :: k, i
      integer, parameter :: kend=1000000
      double precision   :: MAXv
      double precision,allocatable,dimension(:,:) :: Arr
!$acc mirror(Arr)
      double precision,allocatable,dimension(:)   :: MX_GPU
!$acc mirror(MX_GPU)
     
      allocate(Arr(kend,7))
      allocate(MX_GPU(kend))
     
      do i=1,7
         do k=1,kend
            Arr(k,i) = sin(k*1.0)+cos(i*1.0)+2.1000000
         end do
      end do
      MX_GPU=0.
!$acc update device(Arr)
     
!$acc region 
       DO K = 1, KEND
          MX_GPU(K) = 1./Arr(K,7)   
       END DO
!$acc end region 
!$acc region 
       MAXv = maxval(MX_GPU)
!$acc end region

      print*
      print*,'MaxValue correct =    0.5394028083693713'
      print*,'MaxValue         = ',MAXv
     
      end program main

% pgf90 -ta=nvidia maxv2.f90 -V11.7 
% a.out
 
 MaxValue correct =    0.5394028083693713
 MaxValue         =    0.5394028083693713

Example using a local clause:

% cat maxv1.f90

      program main
     
      use accel_lib
      implicit none

      integer :: k, i
      integer, parameter :: kend=1000000
      double precision   :: MAXv
      double precision,allocatable,dimension(:,:) :: Arr
!$acc mirror(Arr)
      double precision,allocatable,dimension(:)   :: MX_GPU
     
      allocate(Arr(kend,7))
      allocate(MX_GPU(kend))
     
      do i=1,7
         do k=1,kend
            Arr(k,i) = sin(k*1.0)+cos(i*1.0)+2.1000000
         end do
      end do
      MX_GPU=0.
!$acc update device(Arr)
 
!$acc region local(MX_GPU)
       DO K = 1, KEND
          MX_GPU(K) = 1./Arr(K,7)   
       END DO
       MAXv = maxval(MX_GPU)
!$acc end region

      print*
      print*,'MaxValue correct =    0.5394028083693713'
      print*,'MaxValue         = ',MAXv
     
      end program main

% pgf90 -ta=nvidia maxv1.f90 -V11.7
% a.out
 
 MaxValue correct =    0.5394028083693713
 MaxValue         =    0.5394028083693713

Hope this helps,
Mat

Hello,

TPR 18009 has been determined fixed in 13.5, though it may have been
fixed in an earlier release.

Thanks for your problem submission.

regards,
dave