"some" Overloaded Reduction Intrinsics

I see in pgi18cudaforug.pdf that “Beginning in PGI 15.1, the sum, maxval, and minval host intrinsics are overloaded to accept device or managed arrays when the cudafor module is used.”

This is great, but I am puzzled why not all reduction intrinsics are implemented for device arrays.

On one hand I can easily implement myself some of the unimplemented reduction intrinsic. For example to count array elements that are less than -1:

!$cuf kernel do <<< , >>>
do j=1,N
if (a(j)<-1.0) cnt=cnt+1
end do

Are there similar simple implementations of the “maxloc” or the “pack” function?

Are there similar simple implementations of the “maxloc” or the “pack” function?

No, not yet. It’s quite a bit of work to get these working on the device so we’ve prioritized them as needed. I’ve added an RFE (TPR#26995) since the more requests we get for them, the higher we’ll prioritize them.

-Mat