C++ class in OpenACC

Hi,

I am trying to learn how to use OpenACC in C++ classes that have pointers.
While I believe I have the vector implementation (“ScalarField”) correctly implemented, there seems to be something wrong, when another class uses ScalarField as members

#include <iostream>

#define Scalar double

class ScalarField {
    private:
        const int _size = 256;
        Scalar* _data;
    public:
        ScalarField() {
            _data = new Scalar[_size];
#pragma acc enter data copyin(this) create(_data[0:_size])
        }
        ~ScalarField() {
            delete[] _data;
#pragma acc exit data delete(_data,this)
            _data = 0;
        }

#pragma acc routine seq
        int size() const {
            return _size;
        }
#pragma acc routine seq
        Scalar& operator [] (int i) const {
            return _data[i];
        }
        void update_self() {
#pragma acc update self(_data[0:_size])
        }
        void update_device() {
#pragma acc update device(_data[0:_size])
        }
};

class Matrix {
    public:
        ScalarField Su;
        ScalarField cF;

        void update_self() {
            Su.update_self();
            cF.update_self();
        }
        void update_device() {
            Su.update_device();
            cF.update_device();
        }
};

int main() {
    Matrix m;
    ScalarField T;

#pragma acc parallel loop
    for(int i = 0; i < T.size(); i++) {
        T[i] = Scalar(i);
        m.Su[i] = Scalar(1.0);
        m.cF[i] = T[i] * 3 + m.Su[i];
    }

    T.update_self();
    m.update_self();

    for(int i = 0; i < T.size(); i++) {
        std::cout << i << " " << T[i] << " " << m.Su[i] << " " << m.cF[i] << std::endl;
    }

    return 0;
}

I am using the following to compile and run the code

nvc++ -acc -Minfo=accel -ta=tesla deep.cpp
./a.out

I have tried using -ta=tesla:managed and -ta=tesla:deepcopy but both failed.
On the other hand, if i remove the “enter data” and “exit data” statements, and use -ta=tesla:managed it works! It is my understanding that data statements are not used when using managed memory and that they just serve us hints for prefecthing, so I am not sure why having the enter/exit data statements change behaviour.

Hi danxpy,

I have tried using -ta=tesla:managed and -ta=tesla:deepcopy but both failed.

Here’s the error:

% nvc++ -acc test.org.cpp -Minfo=accel -gpu=managed ; a.out
main:
     53, Generating NVIDIA GPU code
         56, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     53, Generating implicit copy(T,m) [if not already present]
ScalarField::ScalarField():
     13, Generating enter data copyin(this[:1])
         Generating enter data create(_data[:_size])
ScalarField::~ScalarField():
     17, Generating exit data delete(this[:1],_data[:1])
ScalarField::size() const:
     21, Generating acc routine seq
         Generating NVIDIA GPU code
ScalarField::operator[](int) const:
     25, Generating acc routine seq
         Generating NVIDIA GPU code
ScalarField::update_self():
     30, Generating update self(_data[:_size])
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 8.0, threadid=1
Hint: specify 0x800 bit in NV_ACC_DEBUG for verbose info.
host:0x7ffdca894980 device:0x1497212fa400 size:16 presentcount:1+1 line:13 name:this[:1]
host:0x7ffdca8949e8 device:0x1497212fa000 size:16 presentcount:0+1 line:13 name:this[:1]
host:0x7ffdca8949f8 device:0x1497212fa200 size:16 presentcount:0+1 line:13 name:this[:1]
allocated block device:0x1497212fa000 size:512 thread:1
allocated block device:0x1497212fa200 size:512 thread:1
allocated block device:0x1497212fa400 size:512 thread:1

Present table errors:
m lives at 0x7ffdca8949e8 size 32 partially present in
host:0x7ffdca8949e8 device:0x1497212fa000 size:16 presentcount:0+1 line:13 name:this[:1] file:test.org.cpp
host:0x7ffdca8949f8 device:0x1497212fa200 size:16 presentcount:0+1 line:13 name:this[:1] file:test.org.cpp
FATAL ERROR: variable in data clause is partially present on the device: name=m
 file: test.org.cpp main line:53

The Matrix “m” is not in a data region, so the compiler needs to implicitly copy it when entering the compute region (as indicated in the compiler feedback messages). However, it’s two static members are already present on the device since the ScalarField’s constructor is creating the “this” pointer. Part, but not all, of “m” is already present and hence the error.

CUDA Unified Memory (aka “managed”), is only available for allocated memory with only “_data” is being managed. Since the error occurs due to the static members in Matrix, the error still occurs when using managed.

When you remove the data regions, the runtime can now do a shallow copy of “m” and it’s members with “_data” being a unified memory pointer that’s accessible on both the host and device.

If you changed the code so “m”, “T”, and m’s two data members are allocated, then they all would be put in unified memory.

If you wish to manually mange the data, you need to make a few changes. The way you have “ScalarField” would work if it were not a member of Matrix. The problem being that Matrix also needs to be on the device and the copy needs to come from the top down.

To fix, instead of putting the data region in the constructor, move this to a routine. Next add a constructor to Matrix which copies itself to the device and then calls this initialization routine to get ScalarField’s _data on to the device.

Something like the following:

% cat test.cpp

#include <cstdlib>
#include <iostream>

#define Scalar double

class ScalarField {
    private:
        const int _size = 256;
        Scalar* _data;
    public:
        ScalarField() {
            _data = new Scalar[_size];
        }
        ~ScalarField() {
            delete[] _data;
#pragma acc exit data delete(_data,this)
            _data = 0;
        }

        void init() {
#pragma acc enter data copyin(this) create(_data[0:_size])
        }

#pragma acc routine seq
        int size() const {
            return _size;
        }
#pragma acc routine seq
        Scalar& operator [] (int i) const {
            return _data[i];
        }
        void update_self() {
#pragma acc update self(_data[0:_size])
        }
        void update_device() {
#pragma acc update device(_data[0:_size])
        }
};

class Matrix {
    public:
        ScalarField Su;
        ScalarField cF;

        Matrix () {
#pragma acc enter data copyin(this)
        Su.init();
        cF.init();
        }

        void update_self() {
            Su.update_self();
            cF.update_self();
        }
        void update_device() {
            Su.update_device();
            cF.update_device();
        }
};

int main() {
    Matrix m;
    ScalarField T;

    T.init();
#pragma acc parallel loop present(m,T)
    for(int i = 0; i < T.size(); i++) {
        T[i] = Scalar(i);
        m.Su[i] = Scalar(1.0);
        m.cF[i] = T[i] * 3 + m.Su[i];
    }

    T.update_self();
    m.update_self();

    for(int i = 0; i < T.size(); i++) {
        std::cout << i << " " << T[i] << " " << m.Su[i] << " " << m.cF[i] << std::endl;
    }

    return 0;
}
% nvc++ -acc test.cpp -Minfo=accel ; a.out
main:
     66, Generating present(T,m)
         Generating NVIDIA GPU code
         68, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
ScalarField::~ScalarField():
     18, Generating exit data delete(this[:1],_data[:1])
ScalarField::init():
     23, Generating enter data copyin(this[:1])
         Generating enter data create(_data[:_size])
ScalarField::size() const:
     26, Generating acc routine seq
         Generating NVIDIA GPU code
ScalarField::operator[](int) const:
     30, Generating acc routine seq
         Generating NVIDIA GPU code
ScalarField::update_self():
     35, Generating update self(_data[:_size])
Matrix::Matrix():
     48, Generating enter data copyin(this[:1])
0 0 1 1
1 1 1 4
2 2 1 7
3 3 1 10
... cut due to length ...
253 253 1 760
254 254 1 763
255 255 1 766

Hope this helps,
Mat

1 Like

Hi Mat,

Thank you for the detailed response! I understand the problem with the Matrix class containing ScalarField members now. I expected that the data directives will be ignored when using managed memory but that is not the case. I think I will go with managed memory + remove the enter/exit data statements and let the compiler manage things for me, since the other option is too complex for me.

I am also encountering another issue with managed memory that necessitates use of copyin. When passing a C++ class object to a function by constant reference, I have to do a copyin on it for it to be used in an acc loop. Same for other references declared in global scope. I guess references are viewed as pointers and if declared in static or local stack, they have to be copied manually since managed memory doesn’t take care of them.

I have based the above example on the presentation you gave in 2014 on manged memory. I am curious about what are the new things added regarding managed memory since then that maybe of help when dealing with C++ code.

Thank you!

Correct, the data regions are not ignored, it’s that the data is present when using a unified memory address. Again, currently only allocated memory can be used in UM, so static memory still needs to me manually managed.

I think I will go with managed memory + remove the enter/exit data statements and let the compiler manage things for me, since the other option is too complex for me.

The problem here is that since your objects are static, they will still need to be copied to /from the device. Removing the data regions will cause the compiler having to implicitly copy them, which will be done every time the code enters a compute region. Performance will suffer due to this extra data movement.

If you do want to exclusively use UM, the be sure to change “m” and “T” to be allocated so they will be placed in UM. The data members in Matrix do not need to be allocated, since they are part of “m”.

I am also encountering another issue with managed memory that necessitates use of copyin.

Are the objects being passed in static? If so, then this is likely the problem.

Just passing an object by reference wont put it in UM, this occurs when the object is allocated.

I have based the above example on the presentation you gave in 2014 on managed memory. I am curious about what are the new things added regarding managed memory since then that maybe of help when dealing with C++ code.

I don’t think much has changed since then. If I remember correctly, I was using a a single class. Your example uses a class within another class so needs to be handled differently.

We are in development to add the ability for UM to work with static objects and even stack memory. Though it will only initially be available on Grace Hooper based systems. I can’t give a timeline on when it will be available, but will mean that you wont need to manage memory at all which will greatly reduce the complexity of using GPUs.