Strange issues with OpenACC data and loop directives in C++ classes

I wrote a small program to test out a few issues I was having with OpenACC and C++ classes. The program creates a C-array called data_ on the heap, fills each element of the array with the same value equal to 1.0, copies the class member variables into GPU memory, and then tries to print out some values from the C-array using printf() statements in the process_data() function. The code is as follows:

#include <cstddef>
#include <stdio.h>

class DataProcessor {
    public:
        DataProcessor();
        ~DataProcessor();

        void process_data();

    private:
        unsigned int num_rows_;
        unsigned int num_cols_;
        double *data_;
};

DataProcessor::DataProcessor() {
    unsigned int num_rows_ = 10;
    unsigned int num_cols_ = 10;
    data_ = new double[num_rows_*num_cols_];

    for(std::size_t i = 0; i < num_rows_; ++i) {
        for(std::size_t j = 0; j < num_cols_; ++j) {
            data_[i*num_rows_ + j] = 1.0;
        }
    }

    #pragma acc enter data copyin(this)
    #pragma acc enter data copyin(num_rows_)
    #pragma acc enter data copyin(num_cols_)
    #pragma acc enter data copyin(data_[0:num_rows_*num_cols_])
}

DataProcessor::~DataProcessor() {
    #pragma acc exit data delete(num_rows_)
    #pragma acc exit data delete(num_cols_)
    #pragma acc exit data delete(data_[0:num_rows_*num_cols_])
    #pragma acc exit data delete(this)
    delete[] data_;
}

void DataProcessor::process_data() {

    int num_beams = 1;
    int count = 0;
    #pragma acc data copyin(num_beams) copy(count)
    {
        #pragma acc parallel loop
        for(std::size_t i = 0; i < num_beams; ++i) {
            count++;
            printf("data_[0]: %f \n", data_[0]);
        }
    }

    printf("count = %i \n", count);
}

int main() {
    DataProcessor data_processor;

    data_processor.process_data();

    return 0;
}

And can be compiled with the following command:

pgc++ -g -acc -ta=tesla -Minfo=accel data_processor.cpp

Looking at the output from the process_data() function when running the program, I’ve noticed that a few strange issues occur:

  1. Even though the count variable is copied into and out of the OpenACC data region using copy(count), its value is still 0 when it is printed at the end of the function using printf("count = %i \n", count). However, I had expected it to be 1 since num_beams = 1 and thus the for loop is only iterated once, incrementing count by 1.
  2. The num_beams variable is set to 1, copied into the OpenACC data region, and then used in the subsequent for loop. However, when looking at what is printed out in the console using printf("data_[0]: %f \n", data_[0]), I see data_[0]: 1.000000 printed 10 times instead of only once. But, when I instead replace for(std::size_t i = 0; i < num_beams; ++i) with for(std::size_t i = 0; i < 1; ++i). I see data_[0]: 1.000000 printed out only once as expected. Since num_beams = 1, I would have expected the same output in both cases.
  3. When the program ends, I sometimes get an error similar to the following:

(null) lives at 0x19c5e70 size 10296506880 partially present
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 6.1, threadid=1
host:0x19c5e70 device:0x7f991ecfa600 size:800 presentcount:0+1 line:34 name:(null)
host:0x7ffdd7cda708 device:0x7f991ecfa400 size:4 presentcount:0+1 line:34 name:num_cols_
host:0x7ffdd7cda70c device:0x7f991ecfa200 size:4 presentcount:0+1 line:34 name:num_rows_
allocated block device:0x7f991ecfa200 size:512 thread:1
allocated block device:0x7f991ecfa400 size:512 thread:1
allocated block device:0x7f991ecfa600 size:1024 thread:1
deleted block device:0x7f991ecfa000 size:512 threadid=1
FATAL ERROR: variable in data clause is partially present on the device: name=(unknown)
file:/home/alex/Desktop/OpenACC Tests/data_processor.cpp _ZN13DataProcessorD1Ev line:39

and judging by the last line in the error, it occurs due to the last line in the destructor given by delete[] data_;. However, since data_ is deleted from the GPU with #pragma acc exit data delete(data_[0:num_rows_*num_cols_]) before delete[] data_ is called, I’m surprised an error occurs as all. Furthermore, I’m surprised that the error only occurs sometimes, and not deterministically each time the program is run.

So far I’ve been left scratching my head at the above-mentioned behavior, and was wondering if there’s something that I’m doing incorrectly that is causing such issues. In case it may be of use, I’m running the code on Ubuntu 18.04, and pgc++ --version gives me the following compiler information:

pgc++ (aka nvc++) 20.11-0 LLVM 64-bit target on x86-64 Linux -tp haswell 
PGI Compilers and Tools
Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.

Any help would be appreciated, thanks.

Hi jeffr1992,

I could only reproduce the partially present error which is due to the copyin of the scalar data members. Since they are copied in when you copyin “this”, doing it again can cause this. A data directive performs a shallow copy so when you copy an aggregate type, all the members of that type are copied in. However for pointer members, only the pointer itself is copied, so you still need the copyin of “data” so the memory it points to gets copied in as well.

The data region to copy in num_beams is unnecessary since it scalar (scalars are first private by default) but also since it’s only used as the loop bounds and isn’t used in the loop body as well. You don’t need to copy “count” either since reduction variables are implicitly copied for you.

I’d suggest adding a “default(present)” on your compute construct so the compiler doesn’t add the implicitly copy. Finally, I added a reduction clause for completeness, although the NV HPC compiler can auto-detect reductions other compilers sometimes can’t always.

% cat test.cpp
#include <cstddef>
#include <stdio.h>

class DataProcessor {
    public:
        DataProcessor();
        ~DataProcessor();

        void process_data();

    private:
        unsigned int num_rows_;
        unsigned int num_cols_;
        double *data_;
};

DataProcessor::DataProcessor() {
    unsigned int num_rows_ = 10;
    unsigned int num_cols_ = 10;
    data_ = new double[num_rows_*num_cols_];

    for(std::size_t i = 0; i < num_rows_; ++i) {
        for(std::size_t j = 0; j < num_cols_; ++j) {
            data_[i*num_rows_ + j] = 1.0;
        }
    }

    #pragma acc enter data copyin(this)
    #pragma acc enter data copyin(data_[0:num_rows_*num_cols_])
}

DataProcessor::~DataProcessor() {
    #pragma acc exit data delete(data_[0:num_rows_*num_cols_])
    #pragma acc exit data delete(this)
    delete[] data_;
}

void DataProcessor::process_data() {

    int num_beams = 1;
    int count = 0;
    #pragma acc parallel loop default(present) reduction(+:count)
    for(std::size_t i = 0; i < num_beams; ++i) {
         count++;
         printf("data_[0]: %f \n", data_[0]);
    }

    printf("count = %i \n", count);
}

int main() {
    DataProcessor data_processor;

    data_processor.process_data();

    return 0;
}
% nvc++ -acc -Minfo=accel test.cpp -o test
DataProcessor::DataProcessor():
     30, Generating enter data copyin(this[:1],data_[:num_cols_*num_rows_])
DataProcessor::~DataProcessor():
     35, Generating exit data delete(this[:1],data_[:num_cols_*num_rows_])
DataProcessor::process_data():
     41, Generating Tesla code
         43, #pragma acc loop gang /* blockIdx.x */
             Generating reduction(+:count)
     41, Generating default present(this[:])
         Generating implicit copy(count) [if not already present]
% ./test
data_[0]: 1.000000
count = 1

Hope this helps,
Mat

Thanks very much Mat, removing

#pragma acc enter data copyin(num_rows_)
#pragma acc enter data copyin(num_cols_)

in the constructor and removing

#pragma acc exit data delete(num_rows_)
#pragma acc exit data delete(num_cols_)

from the destructor solved issues (1) and (2), but surprisingly issue (3) is still present. Thanks also for the suggestion of adding the reduction for count to the parallel loop construct, I’d completely missed that.

Regarding issue (3) still occuring, I copied the code you’d written, and compiled it as you had. Here’s the output from my terminal of compiling and running the program a few times. As can be seen, the error occurs on the seventh run of the program.

(base) alex@alex-MACH-WX9:~/Desktop/OpenACC Tests$ nvc++ -acc -Minfo=accel test.cpp -o test
DataProcessor::DataProcessor():
     30, Generating enter data copyin(this[:1],data_[:num_cols_*num_rows_])
DataProcessor::~DataProcessor():
     35, Generating exit data delete(this[:1],data_[:num_cols_*num_rows_])
DataProcessor::process_data():
     41, Generating Tesla code
         43, #pragma acc loop gang /* blockIdx.x threadIdx.x */
             Generating reduction(+:count)
     41, Generating default present(this[:])
         Generating implicit copy(count) [if not already present]
(base) alex@alex-MACH-WX9:~/Desktop/OpenACC Tests$ ./test
data_[0]: 1.000000 
count = 1 
(base) alex@alex-MACH-WX9:~/Desktop/OpenACC Tests$ ./test
data_[0]: 1.000000 
count = 1 
(base) alex@alex-MACH-WX9:~/Desktop/OpenACC Tests$ ./test
data_[0]: 1.000000 
count = 1 
(base) alex@alex-MACH-WX9:~/Desktop/OpenACC Tests$ ./test
data_[0]: 1.000000 
count = 1 
(base) alex@alex-MACH-WX9:~/Desktop/OpenACC Tests$ ./test
data_[0]: 1.000000 
count = 1 
(base) alex@alex-MACH-WX9:~/Desktop/OpenACC Tests$ ./test
data_[0]: 1.000000 
count = 1 
(base) alex@alex-MACH-WX9:~/Desktop/OpenACC Tests$ ./test
data_[0]: 1.000000 
count = 1 
(null) lives at 0x1c64e70 size 8853531392 partially present
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 6.1, threadid=1
host:0x1c64e70 device:0x7f2816cfa200 size:800 presentcount:0+1 line:30 name:(null)
host:0x7ffc59574d40 device:0x7f2816cfa000 size:16 presentcount:0+1 line:30 name:_T17231720_1142
allocated block device:0x7f2816cfa000 size:512 thread:1
allocated block device:0x7f2816cfa200 size:1024 thread:1
deleted block   device:0x7f2816cfa600 size:512 threadid=1 
FATAL ERROR: variable in data clause is partially present on the device: name=(unknown)
 file:/home/alex/Desktop/OpenACC Tests/test.cpp _ZN13DataProcessorD1Ev line:35

Hi Mat, in case you’re available to answer a follow-on to this question, I’ve posted an additional question at Calling object function in another object's function causes OpenACC code to crash

For good or bad, I’m not able to reproduce this issue even after running the binary over 20 times.

It is odd since the partially present error is coming after the print and is occurring for “data_”. The very large size seems to indicate that something got corrupted but I’m not sure. Technically the array shape shouldn’t be added when deleting data, so you can try removing it, but the compile usually ignores it so I doubt it’s the problem Worth a try though.

#pragma acc exit data delete(data_)

If it still occurs, then lets try setting the environment variable “NV_ACC_NOTIFY=16” to see the device allocation and deallocations. If that doesn’t show anything unusual, run a failing case with full debug enabled by setting “NV_ACC_DEBUG=1”. Debug can print quite a bit of output but gives some clues.

I’ll look at your second post in a few.

Following up on this, I tried re-compiling the same test.cpp program today with nvc++ -acc -Minfo=accel test.cpp -o test (i.e. I didn’t even open the code in a text editor, I simply re-compiled the exact same .cpp file), and it now seems to work without getting the error after running /.test quite a few times. I’m not too sure if I should be glad or concerned, but at least it’s gone I suppose.

That is odd. Maybe it’s a intermittent hardware issue?