Thanks Tuan. I’ve submitted a problem report (TPR#18552) and sent it to our compiler engineers.
Note that what happens in this case, is that a temporary host array is created, the device array is copied into this temp array, and then the SUM is performed on the host. As a work around and maybe a permanent change you would consider, is to instead perform the summation on the device. Granted, this would most likely cause your results to be slightly different given that the reduction would be done in parallel.
The simplest way to do this would be to add an Accelerator region around the SUM, and add the compile flag “-ta=nvidia”.