Calling object function in another object's function causes OpenACC code to crash

I wrote a small program to test out a few issues I was having with OpenACC and C++ classes. The program creates a C-array called data_array on the heap, fills each element of the array with the same value, copies the class member variables into GPU memory, and then prints out the first value from the C-array using printf() statements inside a parallel loop in the process_data() function. The code is as follows:

#include <stdio.h>

//
// Class that is used to fill a heap allocated array with values, and then print out the first element multiple times in a parallel loop
//
class DataProcessor {
    public:
        DataProcessor();
        DataProcessor(unsigned int num_rows, unsigned int num_cols, double initial_value);
        ~DataProcessor();

        void process_data();

    private:
        unsigned int num_rows_;
        unsigned int num_cols_;
        double *data_array;
};

DataProcessor::DataProcessor() {}

DataProcessor::DataProcessor(unsigned int num_rows, unsigned int num_cols, double initial_value) {
    num_rows_ = num_rows;
    num_cols_ = num_cols;
    data_array = new double[num_rows_*num_cols_];

    for(size_t i = 0; i < num_rows_; ++i) {
        for(size_t j = 0; j < num_cols_; ++j) {
            data_array[i*num_rows_ + j] = initial_value;
        }
    }

    #pragma acc enter data copyin(this)
    #pragma acc enter data copyin(data_array[0:num_rows_*num_cols_])
}

DataProcessor::~DataProcessor() {
    #pragma acc exit data delete(data_array[0:num_rows_*num_cols_])
    #pragma acc exit data delete(this)
    delete[] data_array;
}

void DataProcessor::process_data() {

    int num_beams = 5;
    #pragma acc data copyin(num_beams)
    {
        printf("data_array[0] in structured data region: %f \n", data_array[0]);

        #pragma acc parallel loop
        for(size_t i = 0; i < num_beams; ++i) {
            printf("data_array[0] in parallel loop: %f \n", data_array[0]);
        }
    }
}

int main() {
    // 1) This test works as expected
    printf("Test 1:\n");
    unsigned int num_rows = 10;
    unsigned int num_cols = 10;
    double initial_value = 888.0;

    DataProcessor data_processor(num_rows, num_cols, initial_value);
    data_processor.process_data();
}

And can be compiled with the following command:
pgc++ -g -acc -ta=tesla -Minfo=accel data_processor.cpp

Now, if I create another class named DataProcessorCaller that instantiates a DataProcessor object in its constructor, and then calls the DataProcessor object’s process_data() function in its call_process_data() function (apologies for such a contrived example) as follows:

//
// Class that is used to call a DataProcessor object's process_data() function
//
class DataProcessorCaller {
    public:
        DataProcessorCaller();
        ~DataProcessorCaller();

        void call_process_data();

    private:
        DataProcessor data_processor_;
};

DataProcessorCaller::DataProcessorCaller() {
    unsigned int num_rows = 10;
    unsigned int num_cols = 10;
    double initial_value = 888.0;
    
    data_processor_ = DataProcessor(num_rows, num_cols, initial_value);
}

DataProcessorCaller::~DataProcessorCaller() {}

void DataProcessorCaller::call_process_data() {
    data_processor_.process_data();
}

int main() {
    // 2) This test breaks with the following error: "call to cuStreamSynchronize returned error 700: Illegal address during kernel execution"
    printf("Test 2:\n");
    DataProcessorCaller data_processor_caller;
    data_processor_caller.call_process_data();

    return 0;
}

the program crashes with the following error:
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

Running cuda-gdb on the program, I see that it crashes at line 54:
printf("data_array[0] in parallel loop: %f \n", data_array[0]);
which is the for loop inside the process_data() function.

As such, I was wondering why calling data_processor.process_data() in the first program works as expected, whereas calling data_processor_caller.call_process_data() in the second program, which essentially calls data_processor.process_data() inside of it, causes the program to crash. From the crash output it seems like the DataProcessorCaller object does not have access to the DataProcessor object’s members that were copied to GPU memory in the DataProcessor object’s constructor, but a better explanation than my naive understanding would be appreciated.

It seems to work for me after I combined the two files and cleaned it up a bit:

% cat test3.cpp
#include <stdio.h>

//
// Class that is used to fill a heap allocated array with values, and then print out the first element multiple times in a parallel loop
//
class DataProcessor {
    public:
        DataProcessor();
        DataProcessor(unsigned int num_rows, unsigned int num_cols, double initial_value);
        ~DataProcessor();

        void process_data();

    private:
        unsigned int num_rows_;
        unsigned int num_cols_;
        double *data_array;
};

DataProcessor::DataProcessor() {}

DataProcessor::DataProcessor(unsigned int num_rows, unsigned int num_cols, double initial_value) {
    num_rows_ = num_rows;
    num_cols_ = num_cols;
    data_array = new double[num_rows_*num_cols_];

    for(size_t i = 0; i < num_rows_; ++i) {
        for(size_t j = 0; j < num_cols_; ++j) {
            data_array[i*num_rows_ + j] = initial_value;
        }
    }

    #pragma acc enter data copyin(this)
    #pragma acc enter data copyin(data_array[0:num_rows_*num_cols_])
}

DataProcessor::~DataProcessor() {
    #pragma acc exit data delete(data_array)
    #pragma acc exit data delete(this)
    delete[] data_array;
}

void DataProcessor::process_data() {

    int num_beams = 5;
    printf("data_array[0] in structured data region: %f \n", data_array[0]);

    #pragma acc parallel loop default(present)
    for(size_t i = 0; i < num_beams; ++i) {
          printf("data_array[0] in parallel loop: %f \n", data_array[0]);
    }
}

//
// Class that is used to call a DataProcessor object's process_data() function
//
class DataProcessorCaller {
    public:
        DataProcessorCaller();
        ~DataProcessorCaller();

        void call_process_data();

    private:
        DataProcessor data_processor_;
};

DataProcessorCaller::DataProcessorCaller() {
    unsigned int num_rows = 10;
    unsigned int num_cols = 10;
    double initial_value = 888.0;

    data_processor_ = DataProcessor(num_rows, num_cols, initial_value);
}

DataProcessorCaller::~DataProcessorCaller() {}

void DataProcessorCaller::call_process_data() {
    data_processor_.process_data();
}


int main() {
    // 1) This test works as expected
    printf("Test 1:\n");
    unsigned int num_rows = 10;
    unsigned int num_cols = 10;
    double initial_value = 888.0;

    DataProcessor data_processor(num_rows, num_cols, initial_value);
    data_processor.process_data();
}
% nvc++ -Minfo=accel test3.cpp -o test -V20.11 -acc -g; ./test
DataProcessor::DataProcessor(unsigned int, unsigned int, double):
     35, Generating enter data copyin(this[:1],data_array[:num_cols_*num_rows_])
DataProcessor::~DataProcessor():
     40, Generating exit data delete(this[:1],data_array[:1])
DataProcessor::process_data():
     46, Generating Tesla code
         49, #pragma acc loop gang, vector(5) /* blockIdx.x threadIdx.x */
     46, Generating default present(this[:])
Test 1:
data_array[0] in structured data region: 888.000000
data_array[0] in parallel loop: 888.000000
data_array[0] in parallel loop: 888.000000
data_array[0] in parallel loop: 888.000000
data_array[0] in parallel loop: 888.000000
data_array[0] in parallel loop: 888.000000

Hi Mat, apologies I probably should have explained this a bit better. What I was trying to explain in the first test was the following: If I create a DataProcessor object called data_processor and then call data_processor.process_data(), then values of the data_array are printed out as expected.

However, in the second test, when creating a DataProcessorCaller object called data_processor_caller and then calling data_processor_caller.call_process_data(), the program crashes. This crash is confusing since the call_process_data() function essentially just calls the process_data() function of the DataProcessor object that is instantiated inside the DataProcessorCaller object.

Gosh, reading the paragraph above sounds confusing. I’ve copied the full program below, which contains the above-mentioned two tests in the main() function:

#include <stdio.h>

//
// Class that is used to fill a heap allocated array with values, and then print out the first element multiple times in a parallel loop
//
class DataProcessor {
    public:
        DataProcessor();
        DataProcessor(unsigned int num_rows, unsigned int num_cols, double initial_value);
        ~DataProcessor();

        void process_data();

    private:
        unsigned int num_rows_;
        unsigned int num_cols_;
        double *data_array;
};

DataProcessor::DataProcessor() {}

DataProcessor::DataProcessor(unsigned int num_rows, unsigned int num_cols, double initial_value) {
    num_rows_ = num_rows;
    num_cols_ = num_cols;
    data_array = new double[num_rows_*num_cols_];

    for(size_t i = 0; i < num_rows_; ++i) {
        for(size_t j = 0; j < num_cols_; ++j) {
            data_array[i*num_rows_ + j] = initial_value;
        }
    }

    #pragma acc enter data copyin(this)
    #pragma acc enter data copyin(data_array[0:num_rows_*num_cols_])
}

DataProcessor::~DataProcessor() {
    #pragma acc exit data delete(data_array[0:num_rows_*num_cols_])
    #pragma acc exit data delete(this)
    delete[] data_array;
}

void DataProcessor::process_data() {

    int num_beams = 5;
    #pragma acc data copyin(num_beams)
    {
        printf("data_array[0] in structured data region: %f \n", data_array[0]);

        #pragma acc parallel loop
        for(size_t i = 0; i < num_beams; ++i) {
            printf("data_array[0] in parallel loop: %f \n", data_array[0]);
        }
    }
}

//
// Class that is used to call the process_data() function of a DataProcessor object, from its call_process_data() function
//
class DataProcessorCaller {
    public:
        DataProcessorCaller();
        ~DataProcessorCaller();

        void call_process_data();

    private:
        DataProcessor data_processor_;
};

DataProcessorCaller::DataProcessorCaller() {
    unsigned int num_rows = 10;
    unsigned int num_cols = 10;
    double initial_value = 888.0;
    
    data_processor_ = DataProcessor(num_rows, num_cols, initial_value);
}

DataProcessorCaller::~DataProcessorCaller() {}

void DataProcessorCaller::call_process_data() {
    data_processor_.process_data();
}

int main() {
    // 1) This test works as expected
    printf("Test 1:\n");
    unsigned int num_rows = 10;
    unsigned int num_cols = 10;
    double initial_value = 888.0;

    DataProcessor data_processor(num_rows, num_cols, initial_value);
    data_processor.process_data();

    // 2) This test breaks with the following error: "call to cuStreamSynchronize returned error 700: Illegal address during kernel execution"
    printf("Test 2:\n");
    DataProcessorCaller data_processor_caller;
    data_processor_caller.call_process_data();

    return 0;
}

Test 1 should run correctly, whereas test 2 should crash the program. This is the output I get when compiling and running the above program:

(base) alex@alex-MACH-WX9:~/Desktop/OpenACC Tests$ pgc++ -g -acc -ta=tesla -Minfo=accel data_processor.cpp
DataProcessor::DataProcessor(unsigned int, unsigned int, double):
     37, Generating enter data copyin(this[:1],data_array[:num_cols_*num_rows_])
DataProcessor::~DataProcessor():
     42, Generating exit data delete(this[:1],data_array[:num_cols_*num_rows_])
DataProcessor::process_data():
     49, Generating copyin(num_beams) [if not already present]
     50, Generating Tesla code
         53, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     50, Generating implicit copyin(this[:]) [if not already present]
(base) alex@alex-MACH-WX9:~/Desktop/OpenACC Tests$ ./a.out 
Test 1:
data_array[0] in structured data region: 888.000000 
data_array[0] in parallel loop: 888.000000 
data_array[0] in parallel loop: 888.000000 
data_array[0] in parallel loop: 888.000000 
data_array[0] in parallel loop: 888.000000 
data_array[0] in parallel loop: 888.000000 
Test 2:
data_array[0] in structured data region: 0.000000 
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

The problem is the line:

data_processor_ = DataProcessor(num_rows, num_cols, initial_value);

The creates a new object then does a shallow copy of new object to data_processor_. Though the constructor called is for the new object, not data_processor_, hence data_processor_ isn’t actually copied to the device.

I would suggest adding an “init” routine that gets called instead which will initialize data_process_. Something like:

% cat test.cpp
#include <stdio.h>
#include <iostream>
#include <openacc.h>
#include <accel.h>

//
// Class that is used to fill a heap allocated array with values, and then print out the first element multiple times in a parallel loop
//
class DataProcessor {
    public:
        DataProcessor();
        DataProcessor(unsigned int num_rows, unsigned int num_cols, double initial_value);
        ~DataProcessor();
        void init(unsigned int num_rows, unsigned int num_cols, double initial_value);
        void process_data();

    private:
        unsigned int num_rows_;
        unsigned int num_cols_;
        double *data_array;
};

DataProcessor::DataProcessor() {}

DataProcessor::DataProcessor(unsigned int num_rows, unsigned int num_cols, double initial_value) {
    init(num_rows,num_cols,initial_value);
}

void DataProcessor::init(unsigned int num_rows, unsigned int num_cols, double initial_value) {
    num_rows_ = num_rows;
    num_cols_ = num_cols;
    data_array = new double[num_rows_*num_cols_];

std::cout << "---- In DataProcessor Init ----" << std::endl;
    for(size_t i = 0; i < num_rows_; ++i) {
        for(size_t j = 0; j < num_cols_; ++j) {
            data_array[i*num_rows_ + j] = initial_value;
        }
    }
    #pragma acc enter data copyin(this)
    #pragma acc enter data copyin(data_array[0:num_rows_*num_cols_])
}

DataProcessor::~DataProcessor() {
    #pragma acc exit data delete(data_array)
    #pragma acc exit data delete(this)
    delete[] data_array;
}

void DataProcessor::process_data() {

    int num_beams = 5;
    printf("data_array[0] in structured data region: %f \n", data_array[0]);

    #pragma acc parallel loop default(present)
    for(size_t i = 0; i < num_beams; ++i) {
        printf("data_array[0] in parallel loop: %f \n", data_array[0]);
    }
}

//
// Class that is used to call the process_data() function of a DataProcessor object, from its call_process_data() function
//
class DataProcessorCaller {
    public:
        DataProcessorCaller();
        ~DataProcessorCaller();

        void call_process_data();

    private:
        DataProcessor data_processor_;
};

DataProcessorCaller::DataProcessorCaller() {
    unsigned int num_rows = 10;
    unsigned int num_cols = 10;
    double initial_value = 888.0;
    data_processor_.init(num_rows, num_cols, initial_value);
}

DataProcessorCaller::~DataProcessorCaller() {}

void DataProcessorCaller::call_process_data() {
    data_processor_.process_data();
}

int main() {
    // 1) This test works as expected
    printf("Test 1:\n");
    unsigned int num_rows = 10;
    unsigned int num_cols = 10;
    double initial_value = 888.0;

    DataProcessor data_processor(num_rows, num_cols, initial_value);
    data_processor.process_data();

    // 2) This test breaks with the following error: "call to cuStreamSynchronize returned error 700: Illegal address during kernel execution"
    printf("Test 2:\n");
    DataProcessorCaller data_processor_caller;
    data_processor_caller.call_process_data();

    return 0;
}
% nvc++ -acc -Minfo=accel -Minfo=accel test.cpp; a.out
DataProcessor::~DataProcessor():
     47, Generating exit data delete(this[:1],data_array[:1])
DataProcessor::init(unsigned int, unsigned int, double):
     42, Generating enter data copyin(this[:1],data_array[:num_cols_*num_rows_])
DataProcessor::process_data():
     53, Generating Tesla code
         56, #pragma acc loop gang /* blockIdx.x */
     53, Generating default present(this[:])
Test 1:
---- In DataProcessor Init ----
data_array[0] in structured data region: 888.000000
data_array[0] in parallel loop: 888.000000
data_array[0] in parallel loop: 888.000000
data_array[0] in parallel loop: 888.000000
data_array[0] in parallel loop: 888.000000
data_array[0] in parallel loop: 888.000000
Test 2:
---- In DataProcessor Init ----
data_array[0] in structured data region: 888.000000
data_array[0] in parallel loop: 888.000000
data_array[0] in parallel loop: 888.000000
data_array[0] in parallel loop: 888.000000
data_array[0] in parallel loop: 888.000000
data_array[0] in parallel loop: 888.000000

Hope this helps,
Mat

Thanks Mat, this worked like a charm, although I’m still a little confused by your explanation of the solution since it feels like I’m essentially replacing the DataProcessor constructor with the init() function that does exactly the same thing. Do you have any links that explain the above concept in a little more detail, I’d love to be able to understand it better.

My understanding is that the original version when calling the constructor, you’re actually creating an anonymous object which is then copied to “data_processor_”. Hence you have two separate this pointers with anonymous object’s this pointer getting created on the device and then subsequently deleted in the implied destructor call.
data_processor’s this pointer doesn’t get created on the device in this scenario. By using the “init” routine, you’re ensuring data_processor_ is actually getting created on the device.

1 Like