How to correctly apply reduction of array elements on GPU without the need to copy between host and device?

HomamBetar · February 25, 2023, 10:16pm

Hello,
I would ask about applying the reduction clause directly on GPU. I have classes (TensorND), where N refers to the number of dimensions (in my case it can reach 6 ). Some of their member functions calculate the sum of the elements along a specific direction. I put here two examples for the case with 1D, where I have an array on GPU and I want the sum of its elements using reduction on GPU and without copying back and forth between GPU and CPU.

The first version of the function sum return a value. So I expect that the calculations are performed on GPU and copied back to the CPU. So, to make this sum available on GPU again, I should copy it back to GPU in the main program which, logically, is not needed at all. Here is the function implementation

T sum()
    {   
        T zero = 0.0;
        T sum = zero;

        /* PIP */
        #ifdef _OPENACC 
            #pragma acc parallel loop reduction(+ : sum)
        #else 
            #pragma omp parallel for reduction(+ : sum)
        #endif
        for (U i = 0; i < globalDims_; i++)
            sum += values_[i];
        
        return sum;
    }

The second version works with reference where I expected that I can do something to solve the issue. However, as shown in the implementation below, the sum is calculated on GPU and copied to CPU (res variable). Then, I copied it back to the sum variable on GPU. Hence, computationally, this is equivalent to the previous method.

void sum(T &sum)
    {   
        T zero = 0.0;
        sum = zero;
        T res = zero;

        /* PIP */
        #ifdef _OPENACC 
            #pragma acc parallel loop present(this, sum) reduction(+ : res)
        #else 
            #pragma omp parallel for reduction(+ : res)
        #endif
        for (U i = 0; i < globalDims_; i++)
            res += values_[i];
        
        /* Trick */
        sum = res; 
        #pragma acc update device(sum)
    }

My question is how to implement such a function to return value on GPU? I mean something like

void sum(T &sumOnGPU)
{
#pragma acc parallel loop present(, ) reduction(+ : sumOnGPU)
for(i)
sumOnGPU += vlaues[i]
}

Knowing that When I tried that in the second version above, I have the following error

NVC++-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): No reduction scalar symbol (test_distFun1D_Tensor1D.cpp: 301)
Tensor1D<unsigned int, double>::sum(double &):

Thanks in advance for your help,

MatColgrove · February 27, 2023, 5:28pm

Sorry, but I’m not clear on the question. Are you asking how to perform a reduction so the variable does not need to be copied back to the host?

If the scalar variable used in the reduction is present on the device, then no implicit copy back will be performed.

For example in your second case, if “sum” is used instead of “res”, the device copy of sum will be used to store the final reduction value and you can remove the “update device” after the reduction loop. However sum wouldn’t be initialized on the device, so you’d need to add the “update device” before the loop in order to initialize it to zero.

As for the error, “No reduction scalar symbol”, do you have a minimal reproducing example?

The error indicates that a non-scalar variable is being used. I presume “T” is a double, so it’s unclear why you’d be getting an error. I’d need to see it in context.

-Mat

HomamBetar · February 27, 2023, 11:09pm

When I use sum instead of res, I get the mentioned error above. However, I will reproduce a simple example for you to see by tomorrow.
concerning T, you are right it can be double.

Thanks a lot for all your valuable explanations for this question and the others I asked before.

HomamBetar · February 28, 2023, 10:28pm

Hello Mat,
thanks a lot again. Here is a simple code that reproduces the above mentioned error

#include <vector>
#include <iostream>
#include <cmath>

template <typename U, typename T> 
class Tensor1D 
{
    public: 
    U globalDims_; 
    U dims_[1];
    T *values_; 

    Tensor1D() {};
    Tensor1D(std::vector<U> dims) 
    {   
        const T zero = 0.0;
        if(dims.size() != 1)
        {
            std::cout << "---> [Tensor1D - allocateTensor]: size of 1D tensor "
                        << "must be 1. Please check (^_*)!\n";
            exit(0);
        }

        dims_[0] = dims[0];
        globalDims_ = dims_[0];

        /* Allocate memory and initialize with 0 */
        values_ = new T [ dims_[0] ];
        
        /* PIP */
        for (U i = 0; i < dims_[0]; i++) 
            values_[i] = zero;  

        std::cout << "---> [Tensor1D - Constructor]: [Tensor1D] object created "
                    << "(^_^)!\n";
    }

    Tensor1D(std::vector<U> dims, std::string name) 
    {
        std::cout << "---> [Tensor1D - Constructor]: (Tensor1D) object "
                        << "successfully created (^_^)!\n";
    }

    /* operator overloading */
    /* Accessing operators */
    #ifdef _OPENACC 
        #pragma acc routine seq
    #endif
    inline T& operator () (U i)
    {   
        return values_[i];
    }

    #ifdef _OPENACC 
        #pragma acc routine seq
    #endif
    inline T operator () (U i) const
    {   
        return values_[i];
    }

    /* Update device and host */
    inline void accUpdateDevice()
    {
        #ifdef _OPENACC
            #pragma acc update device(values_[0 : globalDims_])
        #endif
    }
    inline void accUpdateHost()
    {   
        #ifdef _OPENACC
            #pragma acc update self(values_[0 : globalDims_])
        #endif
    }

    /* Values Printer */
    void printValues()
    {
        std::cout << "---> [Tensor1D - Printer]: Tensor elements:\n";
        for(U i = 0; i < dims_[0]; i++)
            std::cout << values_[i] << "\t";
        std::cout << "\n";
    }

    /* Destructor */
    ~Tensor1D() 
    {          
        if(values_ != NULL)
            delete [] values_;
        std::cout << "---> [Tensor1D - Destructor]: (Tensor1D) object "
                        << "successfully destroyed (^_*)!\n";
    }
    
    /* Assign to a value */
    Tensor1D& operator = (T s) 
    {   
        /* PIP */
        #ifdef _OPENACC 
            #pragma acc parallel loop present(this)
        #else 
            #pragma omp parallel for
        #endif
        for(U i = 0; i < globalDims_; i++)
            values_[i] = s;
        return *this;
    }

    /* Useful functions : calculations */
    protected: 
    T sum()
    {   
        T zero = 0.0;
        T sum = zero;

        /* PIP */
        #ifdef _OPENACC 
            #pragma acc parallel loop reduction(+ : sum)
        #else 
            #pragma omp parallel for reduction(+ : sum)
        #endif
        for (U i = 0; i < globalDims_; i++)
            sum += values_[i];
        
        return sum;
    }

    void sum(T &sum)
    {   
        T zero = 0.0;
        sum = zero;
        T res = zero;

        /* PIP */
        #ifdef _OPENACC 
            #pragma acc parallel loop present(this, sum) reduction(+ : sum)
        #else 
            #pragma omp parallel for reduction(+ : sum)
        #endif
        for (U i = 0; i < globalDims_; i++)
            sum += values_[i];
    }
};

template <typename U, typename T>
class distFun1P : public Tensor1D<U, T>
{
    public:
    distFun1P(std::vector<U> dims): 
                Tensor1D<U, T>(dims)
    {
        #ifdef _OPENACC 
            #pragma acc enter data copyin(this)
            #pragma acc update device(dims_[0 : 2])
            #pragma acc enter data create(values_[0 : globalDims_])
            #pragma acc update device(values_[0 : globalDims_])
        #endif
        std::cout << "---> [distFun1P - Constructor]: (distFun1P) object "
                    <<"successfully created (^_^)!\n";
    }
    /* Destructor */
    ~distFun1P()
    {
        #ifdef _OPENACC 
            #pragma acc exit data delete (values_[0 : globalDims_])
            #pragma acc exit data delete(this)
        #endif
        std::cout << "---> [distFun1P - Destructor]: (distFun1P) object "
                    <<"successfully destroyed (^_*)!\n";
    }

    /* Operator overloading : Assign to a value*/
    distFun1P& operator = (const T s)
    {
        Tensor1D<U, T>::operator=(s);
        return *this;
    }

    /* Uploaders */
    inline void updateDevice()
    {
        this->accUpdateDevice();
    }
    inline void updateHost()
    {
        this->accUpdateHost();
    }

    T findTotalSum()
    {
        return Tensor1D<U, T>::sum();
    }

    void findTotalSum(T &sum)
    {
        Tensor1D<U, T>::sum(sum);
    }   
};

int main()
{
    using us = unsigned int;
    using real = double;
    using intVec = std::vector<us>;
    using ten1D = Tensor1D<us, real>;
    using df1P = distFun1P<us, real>; 

    us Nx = 8; 
    intVec dims{Nx};
    
    /* Define the distribution function */
    df1P df(dims);
    df = 3.0; 

    real sumOnGPU;
    #pragma acc enter data create(sumOnGPU)
    df.findTotalSum(sumOnGPU);
    
    std::cout << "sumOnGPU before updating from GPU : " 
                << sumOnGPU << "\n";

    #pragma acc update self(sumOnGPU)
    std::cout << "sumOnGPU after updating from GPU : " 
                << sumOnGPU << "\n";

}

Error message is

NVC++-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): No reduction scalar symbol (SumOnGPU_ReturnScalar.cpp: 131)
Tensor1D<unsigned int, double>::sum(double &):

And make file contents for compilation are

compileNV=/opt/nvidia/hpc_sdk/Linux_x86_64/2022/compilers/bin/nvc++
FLAGS="-O2 -g -gopt -Kieee -Minfo=accel -acc=noautopar,sync,gpu -gpu=cc75,cuda10.2,lineinfo,ptxinfo -cudalib=curand"
export NVCOMPILER_ACC_NOTIFY=3
$compileNV -o testGPu SumOnGPU_ReturnScalar.cpp $FLAGS

By the way, I noticed something when using classes with openacc. In order for a class to be properly created on GPU, I must arrange its members starting from the scalar, then fixed-sized array, then dynamically allocated array members. Otherwise, I always obtain errors. To reproduce this issue, you can simply change the order of public members of Tensor1D class, by, for example, putting T *values_[] first.

Thanks for your help,

MatColgrove · February 28, 2023, 11:54pm

I should have asked earlier, what compiler version are you using?

In 22.3, I believe we added support for scalar references in reductions and I can only reproduce your error with 22.1 and earlier:

% nvc++ -acc test.cpp -gpu=cc80 -w -V22.1
NVC++-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): No reduction scalar symbol (test.cpp: 131)
NVC++-F-0704-Compilation aborted due to previous errors. (test.cpp)
NVC++/x86-64 Linux 22.1-0: compilation aborted
% nvc++ -acc test.cpp -gpu=cc80 -w -V22.3
%

Try updating your compiler to the latest version: https://developer.nvidia.com/nvidia-hpc-sdk-downloads

In order for a class to be properly created on GPU, I must arrange its members starting from the scalar, then fixed-sized array, then dynamically allocated array members. Otherwise, I always obtain errors. To reproduce this issue, you can simply change the order of public members of Tensor1D class, by, for example, putting T *values_[] first.

Hmm, this shouldn’t matter and when I tried with this example by putting the values array first, I saw no difference. Something else must be going on.

% diff test.cpp test1.cpp
8a9
>     T *values_;
11d11
<     T *values_;
% nvc++ test.cpp -w -acc -V23.1; a.out
---> [Tensor1D - Constructor]: [Tensor1D] object created (^_^)!
---> [distFun1P - Constructor]: (distFun1P) object successfully created (^_^)!
sumOnGPU before updating from GPU : 0
sumOnGPU after updating from GPU : 24
---> [distFun1P - Destructor]: (distFun1P) object successfully destroyed (^_*)!
---> [Tensor1D - Destructor]: (Tensor1D) object successfully destroyed (^_*)!
% nvc++ test1.cpp -w -acc -V23.1 ; a.out
---> [Tensor1D - Constructor]: [Tensor1D] object created (^_^)!
---> [distFun1P - Constructor]: (distFun1P) object successfully created (^_^)!
sumOnGPU before updating from GPU : 0
sumOnGPU after updating from GPU : 24
---> [distFun1P - Destructor]: (distFun1P) object successfully destroyed (^_*)!
---> [Tensor1D - Destructor]: (Tensor1D) object successfully destroyed (^_*)!

HomamBetar · March 3, 2023, 8:52pm

Thanks, @MatColgrove . Sorry for being late. I will follow your suggestions tomorrow and put my feedback here.

Concerning the compiler version, it is the following

nvc++ 22.1-0 64-bit target on x86-64 Linux -tp haswell 
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

system · March 17, 2023, 8:52pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Porting if-else statements and function evaluation to GPU Legacy PGI Compilers	7	8461	December 4, 2014
Reduction & block dimension Using the easiest reduction example of the SDK CUDA Programming and Performance	6	2208	November 23, 2009
Nesting a GPU loop inside a CPU loop? nvc, nvc++ and nvfortran	11	1155	August 27, 2021
Clarification of reduction variables with implicit copy Legacy PGI Compilers	7	6235	February 14, 2020
Sum reduction working in Fermi, Kepler and Maxwell CUDA Programming and Performance	10	3632	February 1, 2016
Help with strange error CUDA Programming and Performance	8	2095	February 25, 2010
Nvidia process not running Linux	25	2854	December 31, 2021
Using classes in openACC nvc, nvc++ and nvfortran	11	742	March 20, 2023
Incorrect GPU results with #pragma acc atomic capture nvc, nvc++ and nvfortran	11	1355	February 27, 2023
Example Graphs work but Remaking them from scratch shows Unkown Type errors for ObjectCounter DeepStream SDK	5	360	May 25, 2023

How to correctly apply reduction of array elements on GPU without the need to copy between host and device?

Related topics