I see in pgi18cudaforug.pdf that “Beginning in PGI 15.1, the sum, maxval, and minval host intrinsics are overloaded to accept device or managed arrays when the cudafor module is used.”
This is great, on the other hand I am puzzled why not all reduction intrinsics are implemented for device arrays.
On one hand I can easily implement for example “count” as
!$cuf kernel do <<< , >>>
if (a(j)<-1.0) cnt=cnt+1
On the other hand other functions reduction functions like maxloc seems to go beyond my experience to implement.
I have similar problem with the Transformational Intrinsic Function “pack” function. I would like to used device arrays because using managed arrays seems to be very slow.