With the ability of 10.5 for accelerator regions to access CUDA Fortran device data, I was wondering: is there a listing of what reductions have been implemented in the Accelerator model?
I know of one, sum, from experience and forum posts, but I was wondering what other reductions I/we should be on the lookout for exploiting now.
I thought you would find this interesting. But first, our compiler manager wanted me to point out that mixing CUDA Fortran and the PGI Accelerator model is not officially supported. He let it leak into 10.5 more as a beta feature for advanced users such as yourself to experiment with. I personally really like it since I can now manage my data using CUDA Fortran but have the compiler write my kernels (via ACC). Though, there’s a lot of open questions that need to be flushed out before we can say it’s officially supported.
Any code that can be accelerated could be used. Specifically for reductions, the sum, minval, maxval, and product intrinsics will all work and use optimized versions. What actually happens is that these intrinsics get inlined (as a loop) and then the normal compiler analysis recognize the reduction just as if you wrote the loop yourself. In other words, the compiler is not specifically recognizing that the sum reductions is being used, rather is just falls out through the analysis. For other intrinsics that get turned into a function call, such matmul, the compiler will not accelerate them.
Hope this helps,