I see in pgi18cudaforug.pdf that “Beginning in PGI 15.1, the sum, maxval, and minval host intrinsics are overloaded to accept device or managed arrays when the cudafor module is used.”
This is great, but I am puzzled why not all reduction intrinsics are implemented for device arrays.
On one hand I can easily implement myself some of the unimplemented reduction intrinsic. For example to count array elements that are less than -1:
!$cuf kernel do <<< , >>>
do j=1,N
if (a(j)<-1.0) cnt=cnt+1
end do
Are there similar simple implementations of the “maxloc” or the “pack” function?