Hi Mat,
Thanks for the explanations.
Went through your answer and prepared a 2nd iteration. I didn’t get the terminology when you say
module device scalars> . If they are read-only then put them in constant memory. If you do need write to them from the device, then you’ll need to worry about synchronization.
Can you point out which part of the 1st iteration (or the 2nd below, whichever) relates to module device scalars. By scalars, you mean a single value variable such as constvalN and derivedvalN (in contrast to arrays and derived data types) ?
The 2nd iteration is below. Please suggest if the attributes and assignments are proper.
! This is a pseudo code which resembles the fortran implementation of an algorithm.
! This is to understand CUDA functionalities and develop a CUDA Fortran version of the code.
! val = func(args) is used in the mathematical sense.
! Could be anything from a binary operation among args to a function/subroutine call that will set the value for val.
! N is used in algebraic sense. N can be any positive integer. So constvarN could be constvar1/constvar34..
module const
integer,constant :: constvar1, constvar2, constvarN
!(could as well be real/complex type in the rest of this code)
!.
!.
end module const
module derivednum
use const
integer :: derivedval1, derivedval2, derivedvalN
integer, dimension(constvarN) :: derivedarrN
derivedval1 = func ( .. ,constvarN,..)
derivedarrN = func(..,constvarN,..)
!.
!.
end module derivednum
module gpuparsection
use const
use derivednum
integer :: furtherderivedvals,
integer,device :: devderivedvalN
integer,device,dimension(constvarN) :: devderivedarrN
attributes(global) subroutine dev_kernel( args, devderivedvalN, devderivedarrN )
! kernel has 3-6 level of not-tightly-nested loops
!_at different stages, calculations depend on_
constvar1, constvar2..constvarN
devderivedvalN
devderivedarrN
furtherderivedvals
end subroutine dev_kernel
subroutine callingroutine
integer :: args
!.
!.
derivedval2 = func (constvar1, .. constvarN)
devderivedval2 = derivedval2
devderivedarrN = derivedarrN
call dev_kernel <<< >>> ( args, devderivedvalN, devderivedarrN )
end subroutine callingroutine
end module gpuparsection
program main
use const
use derivednum
use gpuparsection
call callingroutine
end program main
In particular, as a first issue subroutine dev_kernel, needs to make use of a number of scalars (in the way I have understood and mentioned earlier in this post), arrays etc ( constvar1, constvar2…constvarN
devderivedvalN
devderivedarrN
furtherderivedvals
)
In fact, our code will have a couple of tens of constvar, derivedval and derivedarr. Copying them to device-attributed-variables and/or passing them as arguments to subroutine dev_kernel will be lengthy and, more importantly, untidy. And I suspect such discrete copies would make the overall runtime slower than a non-CUDA code!
As a second issue, the prospective subroutine dev_kernel will have upto 6 or 8 loosely nested do loops. That is, many scalars and some arrays need to be calculated before the next level of do loop starts, which will then need those values. This being the scenario, I am again worried about the speed up of the CUDA version of the code against the original fortran one.
So I am wondering what is your/PGI’s take on these two issues.