OpenACC "FATAL ERROR: data in PRESENT clause was not found on device" in basic matrix class

I’m trying to write a simple matrix class that works with OpenACC, but I’m getting a runtime error that occurs when I uncomment code that defines the multiplication operator overload (without actually using the multiplication operator when running the code). The code is as follows:

#include <stdlib.h>
#include <stddef.h>
#include <stdio.h>

class Matrix {
    public:
        Matrix();
        Matrix(const size_t num_rows, const size_t num_cols);
        Matrix(const Matrix &m); // Copy constructor
        ~Matrix();

        size_t num_rows() const;
        size_t num_cols() const;

        float operator()(const size_t row, const size_t col) const;
        float& operator()(const size_t row, const size_t col);
        Matrix& operator=(const Matrix &m); // Copy assignment operator
        Matrix operator*(const Matrix &m2);

    private:
        float *data_;
        size_t num_rows_;
        size_t num_cols_;
};

Matrix::Matrix(const size_t num_rows, const size_t num_cols) :
    num_rows_(num_rows),
    num_cols_(num_cols)
{
    // Use malloc instead of new since new can't be used if Matrix is created inside an acc routine seq function?
    // For more info, see https://stackoverflow.com/questions/43497285/openacc-the-c-new-operator-issue
    data_ = (float*)malloc(num_rows_*num_cols_*sizeof(float));

    #pragma acc enter data copyin(this)
    #pragma acc enter data create(data_[0:num_rows_*num_cols_])
}

// Copy constructor
Matrix::Matrix(const Matrix &m) :
    num_rows_(m.num_rows_),
    num_cols_(m.num_cols_)
{    
    // Use malloc instead of new since new can't be used if Matrix is created inside an acc routine seq function?
    // For more info, see https://stackoverflow.com/questions/43497285/openacc-the-c-new-operator-issue
    data_ = (float*)malloc(num_rows_*num_cols_*sizeof(float));
    
    #pragma acc enter data copyin(this)
    #pragma acc enter data create(data_[0:num_rows_*num_cols_])
    
    #pragma acc parallel loop present(data_[0:num_rows_*num_cols_], m)
    for(size_t i = 0; i < num_rows_; ++i) {
        for(size_t j = 0; j < num_cols_; ++j) {
            (*this)(i,j) = m(i,j);
        }
    }
}

Matrix::~Matrix() {
    #pragma acc exit data delete(data_)
    #pragma acc exit data delete(this)

    free(data_);
}

// Copy assignment operator
#pragma acc routine seq
Matrix& Matrix::operator=(const Matrix &m) {
    if(this == &m){
        return *this;
    }

    for(size_t i = 0; i < this->num_rows_; ++i) {
        for(size_t j = 0; j < this->num_cols_; ++j) {
            (*this)(i,j) = m(i,j);
        }
    }

    return *this;
}

#pragma acc routine seq
size_t Matrix::num_rows() const {
    return num_rows_;
}

#pragma acc routine seq
size_t Matrix::num_cols() const {
    return num_cols_;
}

#pragma acc routine seq
float Matrix::operator()(const size_t row, const size_t col) const {
    return data_[row*num_cols_ + col];
}

#pragma acc routine seq
float& Matrix::operator()(const size_t row, const size_t col) {
    return data_[row*num_cols_ + col];
}

// ERROR SEEMS TO BE OCCURRING HERE!
#pragma acc routine seq
Matrix Matrix::operator*(const Matrix &m2) {

    Matrix m3(this->num_rows_, m2.num_cols_);
    for(size_t i = 0; i < this->num_rows_; ++i) {
        for(size_t j = 0; j < m2.num_cols_; ++j) {
            m3(i,j) = 0.0f;
            for(size_t k = 0; k < this->num_cols_; ++k) {
                m3(i,j) += (*this)(i,k)*m2(k,j);
            }
        }
    }

    return m3;
}

int main() {
    const size_t num_rows = 4;
    const size_t num_cols = 4;

    Matrix m1(num_rows,num_cols);
    Matrix m2(num_rows,num_cols);
    Matrix m3(num_rows,num_cols);

    #pragma acc parallel loop present(m1, m2, m3)
    for(size_t i = 0; i < num_rows; ++i) {
        for(size_t j = 0; j < num_cols; ++j) {
            m1(i,j) = 2.5678f;
            m2(i,j) = -1.432f;
            m3(i,j) = 0.0f;
        }
    }
    
    #pragma acc kernels present(m1, m2, m3)
    { 
        m3 = m1;
    }

    return 0;
}

If I run the code while the multiplication operator overload is commented out, then the programs works as expected. However, if I uncomment the multiplication operator overload and run the program again, the program crashes with the following error:

hostptr=0x7ffc99c6cd98,eltsize=24,name=m1,flags=0x200=present,async=-1,threadid=1
FATAL ERROR: data in PRESENT clause was not found on device 1: name=m1 host:0x7ffc99c6cd98
 file:/home/jeff/dev/tests/matrix_tests.cpp main line:124

However, the error points to line 124, which is the following:
Matrix m3(num_rows,num_cols);

So I’m at a bit of a loss as to why the program is crashing, especially since multiplication is never used. The compiler I’m using is nvc++ 21.3 and the OS is Ubuntu 18.04. Any help would be appreciated.

Hi jeffr1992,
The 21.7 release shows the issue a bit better with the following syntax error:

% nvc++ -acc -Minfo=accel test.org.cpp -V21.7
main:

Matrix::Matrix(unsigned long, unsigned long):
29, Generating implicit acc routine seq
Generating acc routine seq
Generating Tesla code
NVC+±S-1065-Unsupported nested compute construct in compute construct or acc routine (test.org.cpp: 45)

The problem being that you’re calling the constructor from within device code which requires the compiler to implicitly generate a device constructor. However since data and compute directives can’t be used on the device, they are being ignored in 21.5 (syntax error in 21.7), thus the “m” objects aren’t getting created on the device and the present error.

You can work around this by creating two constructors, one for the device and one for the host with only the host version containing the directives, but I highly advise not allocating data on the device. Besides device side mallocs being slow, the default heap size is small often leading to a heap overflow. While you can work around this by setting the environment variable “NV_ACC_CUDA_HEAPSIZE” to increase the heap size, it’s not ideal.

Also, since you’re returning the newly allocated “m3” object from the multiply operator, this address will only be accessible on the device during the execution of this one kernel. If you want to use this memory on the host or across kernel calls, it will fail.

I highly suggest you rethink your methods so no device side allocation is necessary.

Hope this helps,
Mat

Hi Mat, thanks for the response. You’ve stirred up a bunch of new questions on my side, which I hope you’ll be able to help me with :). The reason I’m trying to make this class is because it seems like the Eigen and OpenCV libraries aren’t well supported in OpenACC (e.g. I can’t use Eigen::Matrix4f or cv::Mat data structures and their corresponding member functions), so I wanted to write my own basic matrix data structure that could be instantiated and used within a device function.

  1. Just so I understand you correctly, when you say “data and compute directives can’t be used on the device” are you talking about the following lines in the constructors for data directives:
#pragma acc enter data copyin(this)
#pragma acc enter data create(data_[0:num_rows_*num_cols_])

and for compute directives that aren’t allowed to be used on the device, would this be something along the lines of the following:

#pragma acc routine seq
void do_something() {
    ...
    
    #pragma acc parallel loop
    for(size_t i = 0; i < num_elements; ++i){
        ...
    }
    
    ...
}

where a parallel loop is called within a device function?

If the above is true, I wanted to ask a question about the following code that creates a stack allocated 4x4 matrix:

struct Matrix4 {
    static const size_t num_rows = 4;
    static const size_t num_cols = 4;
    float data[num_rows*num_cols];
    
    Matrix4() {
        #pragma acc enter data copyin(this)
    }
    ~Matrix4() {
        #pragma acc exit data delete(this)
    }
    
    #pragma acc routine seq
    float operator()(const size_t row, const size_t col) const {
        return data[row*num_cols + col];
    }

    #pragma acc routine seq
    float& operator()(const size_t row, const size_t col) {
        return data[row*num_cols + col];
    }

    #pragma acc routine seq
    Matrix4 operator*(const Matrix4 m2) const {
        
        Matrix4 m3;
        for(size_t i = 0; i < this->num_rows; ++i) {
            for(size_t j = 0; j < m2.num_cols; ++j) {
                m3(i,j) = 0.0f;
                for(size_t k = 0; k < this->num_cols; ++k) {
                    m3(i,j) += (*this)(i,k)*m2(k,j);
                }
            }
        }

        return m3;
    }
};

As an example of using the above struct within a device function, I use Matrix4 objects to compute some frame transformations as follows:

#pragma acc routine seq
float observation_likelihood(...) {
    ...
    
    const Matrix4 odom_to_robot_transform = utils::planar_pose_to_homogeneous_transformation_matrix(pose);
    const Matrix4 odom_to_laser_transform = odom_to_robot_transform*robot_to_laser_transform_matrix_;
    
    ...
}

The above instantiations of Matrix4 objects work well inside the observation_likelihood() device function when I run the code. But since the Matrix4 class uses data directives in its constructor, am I doing something that is “dangerous” even though it compiles and runs without any overt issues?

  1. When you say “I highly advise not allocating data on the device” are you talking only about heap allocation, and not stack allocation? This is probably a stupid question, but do GPUs even separate between the stack and heap like host computers do?

  2. When you say “device side mallocs being slow”, I noticed this when testing using malloc within the above observation_likelihood() device function to create a 16 element heap allocated float array (i.e. a 4x4 matrix), and it seemed to halve the update rate I was getting inside the device function. If I do need to malloc within a device function, is there a faster alternative way to do it?

  3. When you say “the default heap size is small often leading to a heap overflow”, this surprised me quite a bit as I thought that most of my GPU’s memory would be heap memory, and only a few MB would be stack memory. Since I have a GPU with 2 GB of memory, what type of memory is this if it isn’t stack or heap memory?

  4. When you say “If you want to use this memory … across kernel calls, it will fail”, do you mean if I have, for example, the following code:

void do_something() {
    ...
    
    #pragma acc parallel loop
    for(size_t i = 0; i < num_elements; ++i) {
        Matrix4 m1 = process_data();
    }
    
    ...
    
    #pragma acc parallel loop
    for(size_t i = 0; i < num_elements; ++i) {
        Matrix4 m2 = process_some_more_data(m1);
    }
    
    ...
}

it will fail because m1 will go out of scope after the end of the first parallel for loop?

Apologies for so many questions!

For #1, yes, I’m meaning the data directives (enter data in this case) and compute constructs (parallel or kernels) can’t be used within device code, i.e. within other compute constructs or device routines. Technically the OpenACC Spec allows for nested compute regions, but we don’t support this features since it’s unlikely to be performant and we’ve not seen a compelling use case.

I haven’t tried returning a struct with a fixed size array from a device so not 100%, but think it should work. Though I’d be worried about the performance since a copy of the struct would need to be made upon return and if there were lots of threads, it could result in a stack overflow. The default device stack is fairly small (something like 32MB). This can be increased via the env variable NV_ACC_CUDA_STACKSIZE, but unlike the heap, has a maximum.

Personally, I’d make this a method and pass “m3” in as an argument. Something like:

    #pragma acc routine seq
    void multipy_matrix4(const Matrix4 m2, Matrix4 &m3)  {
        
        for(size_t i = 0; i < this->num_rows; ++i) {
            for(size_t j = 0; j < m2.num_cols; ++j) {
                m3(i,j) = 0.0f;
                for(size_t k = 0; k < this->num_cols; ++k) {
                    m3(i,j) += (*this)(i,k)*m2(k,j);
                }
            }
        }
    }
  1. When you say “I highly advise not allocating data on the device” are you talking only about heap allocation, and not stack allocation? This is probably a stupid question, but do GPUs even separate between the stack and heap like host computers do?

Yes, I’m meaning heap allocation, but as noted above, stack can be a concern as well. Yes, GPUs have a stack and heap.

  1. When you say “device side mallocs being slow” , I noticed this when testing using malloc within the above observation_likelihood() device function to create a 16 element heap allocated float array (i.e. a 4x4 matrix), and it seemed to halve the update rate I was getting inside the device function. If I do need to malloc within a device function, is there a faster alternative way to do it?

My understanding is device side malloc’s get serialized and hence the slow-down.

While we haven’t gotten it to work well with OpenACC as of yet, Alpaka (GitHub - alpaka-group/alpaka: Abstraction Library for Parallel Kernel Acceleration 🦙) uses a device side pool allocator for their CUDA version. It’s a bit tricky since you need to pre-compute how much memory to size the pool and use mutex locking to serve pages, but you might be able to simplify the idea for what you need.

4. When you say ***"the default heap size is small often leading to a heap overflow"*** , this surprised me quite a bit as I thought that most of my GPU’s memory would be heap memory, and only a few MB would be stack memory. Since I have a GPU with 2 GB of memory, what type of memory is this if it isn’t stack or heap memory?

The heap size is fixed and needs to be set before the CUDA context is created. I don’t have insights into the CUDA implementation detail to say why its this way so only can refer you the CUDA docs:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#heap-memory-allocation

  1. When you say “If you want to use this memory … across kernel calls, it will fail” , do you mean if I have, for example, the following code

Yes, that’s what I meant, though I’m now second guessing myself. I can’t find anything in the CUDA docs that says if or if not device allocated memory is persistent across kernel invocations so I could be wrong.

Now what you wrote definitely wouldn’t work simply due to scoping, but now I’m wondering if something like the following would work. Granted, not enough to try it given in would be simpler to allocate the device data from the host, but just in case it’s helpful.

void do_something() {
    ...
    Matrix4 * m1;
    #pragma acc parallel loop
    for(size_t i = 0; i < num_elements; ++i) {
        m1 = process_data();  // returns a device allocated pointer
    }
    
    ...
    
    #pragma acc parallel loop deviceptr(m1)  // need to use deviceptr since m1 is not managed by the OpenACC runtime
    for(size_t i = 0; i < num_elements; ++i) {
        Matrix4 m2 = process_some_more_data(m1);
    }

Hmm, I still feel like I’m not quite getting a few things.

  1. Apologies for belaboring the point, but coming back to your statement where you said “I highly advise not allocating data on the device”, surely if I allocate data on the host and then copied it over to the device using the copy or copyin clauses (or the update device directive), this would be less efficient than allocating data directly on the device to begin with (especially if the data is only ever used on the device)?

  2. Regarding the first paragraph of your above response (the paragraph that starts with “For #1, yes…”), since the observation_likelihood() function is a device function (i.e. it’s been decorated with #pragma acc routine seq), and this function instantiates 2 Matrix4 objects whose constructors look like this:

Matrix4()
{
#pragma acc enter data copyin(this)
}

then wouldn’t I get an error during compilation or runtime since you mentioned that data directives can’t be used on device code (and observation_likelihood() is device code)?

  1. Regarding the second paragraph of your above response (the paragraph starting with “I haven’t tried returning…”), I’m assuming that I’d only need to worry about a stack overflow issue if I have a huge amount of parallel calls? If I consider the code snippet from the observation_likelihood() device function, then I need to instantiate 2 Matrix4 objects for each call of the observation_likelihood() device function. Thus, for example, if I need to make 10000 parallel function calls to observation_likelihood(), I’d need to instantiate 20000 Matrix4 objects. Since each Matrix4 object is about 72 bytes in size (2 longs and 4*4 = 16 floats), then 20000 Matrix4 objects would be 1440000 bytes which is 1.44 MB (I could have easily botched this calculation with my lack of understanding of how data structures are stored in memory, so correct me if I’m wrong).

  2. Is there a tool that allows me to see GPU stack and heap usage during, or after, program execution (I’m assuming nvprof could do something like this)? It would be really interesting to see this information.

Consider 100s of thousands of threads all blocked calling a serialized malloc. It’s not going to perform well nor would be more efficient than allocating the memory once from the host an then reusing it. You don’t necessary need to copy the data but instead can use the “create” clause and then initialize on the device.

Granted, most of my experience here is with Fortran automatics which perform implicit allocation. They absolutely kill performance due to the device side allocation.

You are more than welcome to try device side allocation and maybe for your application it will be fine, but I do stand by my advice.

then wouldn’t I get an error during compilation or runtime since you mentioned that data directives can’t be used on device code

Not necessarily since it’s contextual. Since these are subroutines, the same code could have a device and host version so the compiler should just ignore these directives in the device version in this case. The compiler issue here is that it’s not retaining the data directives for the host version.

I’m assuming that I’d only need to worry about a stack overflow issue if I have a huge amount of parallel calls?

Yes, though 10,000 threads is relatively small. An A100 has 80 SM each capable of running 2048 concurrent threads for a total of 163,840. So you may be limiting your program’s ability to maximize GPU utilization. Granted there are many factors that influence overall performance, so having a lower occupancy due to too few threads is not necessarily bad.

Is there a tool that allows me to see GPU stack and heap usage during, or after, program execution (I’m assuming nvprof could do something like this)? It would be really interesting to see this information.

I’ve not seen anything like this in Nsight-Systems or Nsight-Compute (the successors to nvprof which has be deprecated). cuda-memcheck may be the closest to what you want, but I don’t think it has any tracing capabilities or reports about stack/heap usage. It can just check for memory leaks as well as heap and stack overflows. https://developer.nvidia.com/cuda-memcheck