I have a standard openacc reduction function likes
s = 0. !$ACC PARALLEL LOOP REDUCTION(+:s) PRESENT(x,y) do i=1,n s = s + x(i)*y(i) enddo
and the output of PGI_ACC_TIME for the function is
vlxy_acc NVIDIA devicenum=0 time(us): 111,062 2209: compute region reached 13594 times 2209: kernel launched 13594 times grid:  block:  elapsed time(us): total=223,901 max=50 min=15 avg=16 2209: reduction kernel launched 13594 times grid:  block:  elapsed time(us): total=165,437 max=44 min=11 avg=12 2209: data region reached 54376 times 2209: data copyin transfers: 13594 device time(us): total=41,611 max=21 min=3 avg=3 2216: data copyout transfers: 13594 device time(us): total=69,451 max=36 min=4 avg=5
So there are some data copyin and copyout for the scale “s”, which is not intended since “s” is only used on devices. I just read the OpenACC docs at
"… the reduction clause implies copy(s) on the compute construct, "
Is there any method to eliminate the implicit “copy” ?