I have learned from a couple of articles that NVCC compiler is able to perform warp aggregation for atomic operations (e.g., https://developer.nvidia.com/blog/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/). Does PGI Fortran compiler also have similar capabilities?
You should be able to replicate this in CUDA Fortran using Cooperative Groups. See: https://docs.nvidia.com/hpc-sdk/compilers/cuda-fortran-prog-guide/index.html#cfref-fort-mods-dev-mod-coopgr
I meant to ask if the PGI Fortran compiler can do it for me automatically. There is a note at the very top of the link I posted which says: “The NVCC compiler now performs warp aggregation for atomics automatically in many cases, …” . My questions is if the PGI Fortran compiler is also able to do the same.
I am pretty sure CUDA Fortran does not do this.