I have learned from a couple of articles that NVCC compiler is able to perform warp aggregation for atomic operations (e.g., CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics | NVIDIA Technical Blog). Does PGI Fortran compiler also have similar capabilities?
You should be able to replicate this in CUDA Fortran using Cooperative Groups. See: CUDA Fortran Programming Guide Version 22.7 for ARM, OpenPower, x86
I meant to ask if the PGI Fortran compiler can do it for me automatically. There is a note at the very top of the link I posted which says: “The NVCC compiler now performs warp aggregation for atomics automatically in many cases, …” . My questions is if the PGI Fortran compiler is also able to do the same.
I am pretty sure CUDA Fortran does not do this.