I wrote a small program to test out a few issues I was having with OpenACC and C++ classes. The program creates a C-array called data_array
on the heap, fills each element of the array with the same value, copies the class member variables into GPU memory, and then prints out the first value from the C-array using printf()
statements inside a parallel loop in the process_data()
function. The code is as follows:
#include <stdio.h>
//
// Class that is used to fill a heap allocated array with values, and then print out the first element multiple times in a parallel loop
//
class DataProcessor {
public:
DataProcessor();
DataProcessor(unsigned int num_rows, unsigned int num_cols, double initial_value);
~DataProcessor();
void process_data();
private:
unsigned int num_rows_;
unsigned int num_cols_;
double *data_array;
};
DataProcessor::DataProcessor() {}
DataProcessor::DataProcessor(unsigned int num_rows, unsigned int num_cols, double initial_value) {
num_rows_ = num_rows;
num_cols_ = num_cols;
data_array = new double[num_rows_*num_cols_];
for(size_t i = 0; i < num_rows_; ++i) {
for(size_t j = 0; j < num_cols_; ++j) {
data_array[i*num_rows_ + j] = initial_value;
}
}
#pragma acc enter data copyin(this)
#pragma acc enter data copyin(data_array[0:num_rows_*num_cols_])
}
DataProcessor::~DataProcessor() {
#pragma acc exit data delete(data_array[0:num_rows_*num_cols_])
#pragma acc exit data delete(this)
delete[] data_array;
}
void DataProcessor::process_data() {
int num_beams = 5;
#pragma acc data copyin(num_beams)
{
printf("data_array[0] in structured data region: %f \n", data_array[0]);
#pragma acc parallel loop
for(size_t i = 0; i < num_beams; ++i) {
printf("data_array[0] in parallel loop: %f \n", data_array[0]);
}
}
}
int main() {
// 1) This test works as expected
printf("Test 1:\n");
unsigned int num_rows = 10;
unsigned int num_cols = 10;
double initial_value = 888.0;
DataProcessor data_processor(num_rows, num_cols, initial_value);
data_processor.process_data();
}
And can be compiled with the following command:
pgc++ -g -acc -ta=tesla -Minfo=accel data_processor.cpp
Now, if I create another class named DataProcessorCaller
that instantiates a DataProcessor
object in its constructor, and then calls the DataProcessor
object’s process_data()
function in its call_process_data()
function (apologies for such a contrived example) as follows:
//
// Class that is used to call a DataProcessor object's process_data() function
//
class DataProcessorCaller {
public:
DataProcessorCaller();
~DataProcessorCaller();
void call_process_data();
private:
DataProcessor data_processor_;
};
DataProcessorCaller::DataProcessorCaller() {
unsigned int num_rows = 10;
unsigned int num_cols = 10;
double initial_value = 888.0;
data_processor_ = DataProcessor(num_rows, num_cols, initial_value);
}
DataProcessorCaller::~DataProcessorCaller() {}
void DataProcessorCaller::call_process_data() {
data_processor_.process_data();
}
int main() {
// 2) This test breaks with the following error: "call to cuStreamSynchronize returned error 700: Illegal address during kernel execution"
printf("Test 2:\n");
DataProcessorCaller data_processor_caller;
data_processor_caller.call_process_data();
return 0;
}
the program crashes with the following error:
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
Running cuda-gdb
on the program, I see that it crashes at line 54:
printf("data_array[0] in parallel loop: %f \n", data_array[0]);
which is the for loop inside the process_data()
function.
As such, I was wondering why calling data_processor.process_data()
in the first program works as expected, whereas calling data_processor_caller.call_process_data()
in the second program, which essentially calls data_processor.process_data()
inside of it, causes the program to crash. From the crash output it seems like the DataProcessorCaller
object does not have access to the DataProcessor
object’s members that were copied to GPU memory in the DataProcessor
object’s constructor, but a better explanation than my naive understanding would be appreciated.