Hi elephant,
A) Is there a possibility to search for the maximum value in an array using the PGI Accelerator model?
Yes. The compiler is able to recognize and parallelize most reductions including maxval. Here’s an example:
% cat maxval.f90
program testmaxval
real, allocatable, dimension(:) :: arr
real :: maxv
integer :: i, N
N = 10000000
allocate(arr(N))
!$acc region local(arr)
do i=1,N
arr(i) = real(i)/N
end do
maxv = maxval(arr)
!$acc end region
print *, maxv
deallocate(arr)
end program testmaxval
% pgf90 maxval.f90 -Minfo=accel -ta=nvidia
testmaxval:
10, Generating local(arr(:))
Generating compute capability 1.0 binary
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
11, Loop is parallelizable
Accelerator kernel generated
11, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
CC 1.0 : 3 registers; 40 shared, 12 constant, 0 local memory bytes; 100% occupancy
CC 1.3 : 3 registers; 40 shared, 12 constant, 0 local memory bytes; 100% occupancy
CC 2.0 : 8 registers; 8 shared, 52 constant, 0 local memory bytes; 100% occupancy
14, Loop is parallelizable
Accelerator kernel generated
14, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
Max reduction generated for arr$r <<<<< HERE Compiler is parallelizing maxval
CC 1.0 : 6 registers; 1080 shared, 20 constant, 0 local memory bytes; 100% occupancy
CC 1.3 : 6 registers; 1080 shared, 20 constant, 0 local memory bytes; 100% occupancy
CC 2.0 : 8 registers; 1032 shared, 64 constant, 0 local memory bytes; 100% occupancy
% a.out
1.000000
“!$acc do seq”… is the loop executed on the host or device?
Device, albeit very slowly since it will use only a single thread.
Why do I still get performance increase compared to execution without the accelerator region?
If this is the only accelerator routine, then I don’t know since I would expect it to run slowly. I would need an example to better understand why.
If there are other parallelized routines that use the same data, then sequential routines can still help improve performance by eliminating the need to copy data to/from the host.
Does a loop that sequentially executes on the device is faster than if it sequentially executes on the host?
Comparing just a single threaded, sequential kernel with a host side counter part, I would expect the host to run much faster. Besides the overhead of copying data the the device, the individual thread processors on the device are usually much slower. Again, I would need an example to understand why you are seeing the contrary.
Can the compiler and the GPU deal “good” with a loop that looks for example like that:
It should.
I mean is it bad or ok to generate such “private” variables like A1, B,… that are used later in the loop. If I would add an index to all of my variables I would exceed the memory of the GPU…
Scalars are privatized by default and in general are a good thing since they can be stored in registers which are quick to access. The only caveat is if you use too many registers, you must reduce the number of threads per block.
D) Is there a good strategy to reduce memory usage?
We just added a new feature in 11.6 which allows for shared memory automatic arrays. This will allow you to define at runtime the size in bytes of the total amount of shared memory, as the third argument of the kernel launch, which is then mapped to automatics in the kernel. Here’s an example:
% cat automatic.cuf
module m
contains
attributes(global) subroutine ss1( a, b, k )
implicit none
integer, value :: k
integer, dimension(:,:) :: a, b
integer(8), shared, dimension(blockdim%x, blockdim%y) :: s1
integer, shared, dimension(k+1, k) :: s2
integer ti, tj, i, j
ti = threadidx%x
tj = threadidx%y
i = (blockidx%x-1)*blockdim%x + ti
j = (blockidx%y-1)*blockdim%y + tj
s1(ti,tj) = a(i,j)
s2(ti,tj) = b(i,j)
call syncthreads()
a(i,j) = s2(tj,ti)
b(i,j) = s1(tj,ti)
end subroutine
end module
program p
use m
use cudafor
implicit none
integer, dimension(:,:), allocatable, device :: da, db
integer, dimension(:,:), allocatable :: ha0, hb0, ha1, hb1, haa, hbb
type(dim3) :: grid,block
integer :: n, i, j, ierr
n = 128
allocate(haa(n,n), hbb(n,n))
allocate(ha0(n,n), hb0(n,n))
allocate(ha1(n,n), hb1(n,n))
allocate(da(n,n), db(n,n))
do j = 1,n
do i = 1,n
haa(i,j) = i*1000 + j
hbb(i,j) = -i*1000 - j
enddo
enddo
da = haa
db = hbb
grid = dim3(16,16,1)
block = dim3(8,8,1)
print *, 'calling ss1'
! The third argument, '900' is the size in bytes
! of the shared memory segment which maps to the
! kernel's shared automatics
call ss1<<<grid,block,900>>>( da, db, 8 )
print *, 'back from ss1'
ierr = cudathreadsynchronize()
if( ierr .ne. 0 )then
print *, 'ss1 launch'
print *, cudageterrorstring( ierr )
endif
print *,'last error is ', ierr
ha1 = da
hb1 = db
end program
% pgf90 -fast automatic.cuf; a.out
calling ss1
back from ss1
last error is 0
Hope this helps,
Mat