Hello,
I would ask about applying the reduction clause directly on GPU. I have classes (TensorND), where N refers to the number of dimensions (in my case it can reach 6 ). Some of their member functions calculate the sum of the elements along a specific direction. I put here two examples for the case with 1D, where I have an array on GPU and I want the sum of its elements using reduction on GPU and without copying back and forth between GPU and CPU.
The first version of the function sum
return a value. So I expect that the calculations are performed on GPU and copied back to the CPU. So, to make this sum available on GPU again, I should copy it back to GPU in the main program which, logically, is not needed at all. Here is the function implementation
T sum()
{
T zero = 0.0;
T sum = zero;
/* PIP */
#ifdef _OPENACC
#pragma acc parallel loop reduction(+ : sum)
#else
#pragma omp parallel for reduction(+ : sum)
#endif
for (U i = 0; i < globalDims_; i++)
sum += values_[i];
return sum;
}
The second version works with reference where I expected that I can do something to solve the issue. However, as shown in the implementation below, the sum is calculated on GPU and copied to CPU (res
variable). Then, I copied it back to the sum
variable on GPU. Hence, computationally, this is equivalent to the previous method.
void sum(T &sum)
{
T zero = 0.0;
sum = zero;
T res = zero;
/* PIP */
#ifdef _OPENACC
#pragma acc parallel loop present(this, sum) reduction(+ : res)
#else
#pragma omp parallel for reduction(+ : res)
#endif
for (U i = 0; i < globalDims_; i++)
res += values_[i];
/* Trick */
sum = res;
#pragma acc update device(sum)
}
My question is how to implement such a function to return value on GPU? I mean something like
void sum(T &sumOnGPU)
{
#pragma acc parallel loop present(, ) reduction(+ : sumOnGPU)
for(i)
sumOnGPU += vlaues[i]
}
Knowing that When I tried that in the second version above, I have the following error
NVC++-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): No reduction scalar symbol (test_distFun1D_Tensor1D.cpp: 301)
Tensor1D<unsigned int, double>::sum(double &):
Thanks in advance for your help,