Hello,

I would ask about applying the **reduction clause** directly on GPU. I have classes (**TensorND**), where N refers to the number of dimensions (in my case it can reach 6 ). Some of their member functions calculate the sum of the elements along a specific direction. I put here two examples for the case with 1D, where I have an array on GPU and I want the sum of its elements using reduction on GPU and without copying back and forth between GPU and CPU.

**The first version of the function** `sum`

return a value. So I expect that the calculations are performed on GPU and copied back to the CPU. So, to make this sum available on GPU again, I should copy it back to GPU in the main program which, logically, is not needed at all. Here is the function implementation

```
T sum()
{
T zero = 0.0;
T sum = zero;
/* PIP */
#ifdef _OPENACC
#pragma acc parallel loop reduction(+ : sum)
#else
#pragma omp parallel for reduction(+ : sum)
#endif
for (U i = 0; i < globalDims_; i++)
sum += values_[i];
return sum;
}
```

**The second version** works with reference where I expected that I can do something to solve the issue. However, as shown in the implementation below, the sum is calculated on GPU and copied to CPU (`res`

variable). Then, I copied it back to the `sum`

variable on GPU. Hence, computationally, this is equivalent to the previous method.

```
void sum(T &sum)
{
T zero = 0.0;
sum = zero;
T res = zero;
/* PIP */
#ifdef _OPENACC
#pragma acc parallel loop present(this, sum) reduction(+ : res)
#else
#pragma omp parallel for reduction(+ : res)
#endif
for (U i = 0; i < globalDims_; i++)
res += values_[i];
/* Trick */
sum = res;
#pragma acc update device(sum)
}
```

My question is how to implement such a function to return value on GPU? I mean something like

```
void sum(T &sumOnGPU)
{
#pragma acc parallel loop present(, ) reduction(+ : sumOnGPU)
for(i)
sumOnGPU += vlaues[i]
}
```

Knowing that When I tried that in the second version above, I have the following error

```
NVC++-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): No reduction scalar symbol (test_distFun1D_Tensor1D.cpp: 301)
Tensor1D<unsigned int, double>::sum(double &):
```

Thanks in advance for your help,