Fatal error with `acc enter data`

Hi
I am a bit struggling with using a c++ class in an openacc data region. While trying to make an example to reproduce an error, I stumbled on another error that seems off to me:

#include <stdio.h>

struct A {
    A() {
        // printf("?\n");
        #pragma acc enter data copyin(this) create(x[0:100])
    }
    void updateDevice() {
        #pragma acc update device(x[0:100])
    }
    ~A() {
        #pragma acc exit data delete(x)
        #pragma acc exit data delete(this)
    }
    
    int x[100];
};

int main() {
    A a;
    for( auto &x: a.x ) x = 1;
    a.updateDevice();
    
    #pragma acc parallel loop gang present(a)
    for( int i=0; i<100; i++ ){
        printf("%d ", a.x[i]);
    }
    printf("\n");
    return 0;
}

This results in

FATAL ERROR: data in update device clause was not found on device 1: name=x[:]

But after uncommenting the printf in the constructor of A, it works! Somehow, the constructor optimization seems to have discarded the acc enter directives. Is this the expected behavior, or an error in compiler optimization?

Hi fredpz,

Likely since the struct member is static and there’s no executable code in the constructor, the constructor doesn’t get invoked. The compiler should probably be detecting that there’s an OpenACC pragma in there and keep the constructor.

However, for awhile now I’ve changed to moving the enter data directive into it’s own initialization routine. This makes it easier when there are class members and you need to do a deep copy given the device initialization can, if needed, be invoked separately from the constructor. Something like:

% cat test.cpp

#include <stdio.h>

struct A {
    A() {
        initDevice();
    }
    void initDevice() {
        #pragma acc enter data copyin(this) create(x[0:100])
    }
    void updateDevice() {
        #pragma acc update device(x[0:100])
    }
    ~A() {
        #pragma acc exit data delete(x)
        #pragma acc exit data delete(this)
    }

    int x[100];
};

int main() {
    A a;
    for( auto &x: a.x ) x = 1;
    a.updateDevice();

    #pragma acc parallel loop gang present(a)
    for( int i=0; i<100; i++ ){
        printf("%d ", a.x[i]);
    }
    printf("\n");
    return 0;
}

% nvc++ -acc  test.cpp ; a.out
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

-Mat

Ok thank you for the explanation. Now I am unable to reproduce another error in my program that has essentially the same structure as the snippet you corrected, but with more data, different types, separated header file, and many different operations in between. The error is

Present table errors:
this[:] lives at 0x2ea6630 size 576 partially present in
host:0x2ea6708 device:0x7f5475b01e00 size:256 presentcount:1+1 line:66 name:this[:1] file:.........
FATAL ERROR: variable in data clause is partially present on the device: name=this[:]
 file:.........

I looks like the object was transferred several times to the device even though the enter data copyin(this) is done only once. I verified by printing something inside initDevice.

Note that the error does not occur anymore if the object is not called within the loop (probably because the present clause is ignored).

Do you know what could be causing this error?

I finally managed to write a minimal reproducer:

#include <stdio.h>
#include <cstdlib>


struct A {
    A( size_t N ) : n( N ), x( new int[N] ) { initDevice(); }
    
    void initDevice() {
        #pragma acc enter data copyin(this)
        #pragma acc enter data create(x[0:n])
    }
    ~A() {
        #pragma acc exit data delete(x[0:n])
        #pragma acc exit data delete(this)
        delete x;
    }
    
    size_t n;
    int * x;
};

struct A_operator {
    void apply( A &a, int i ) {
        printf("%d ", a.x[i]);
    };
};

struct A_runner {
    A_runner( size_t N ) : a( N ) {};
    
    void applyOperator(){
        #pragma acc parallel loop gang present(a)
        for( int i=0; i<a.n; i++ ){
            aop.apply( a, i );
        }
        printf("\n");
    };
    
    A a;
    A_operator aop;
};

int main() {
    A_runner arun( 10 );
    arun.applyOperator();
    
    return 0;
}

The error is FATAL ERROR: variable in data clause is partially present on the device: name=this[:]. Apparently the compiler, when encountering the parallel section, produces an implicit create(this[:]) which probably allocates memory for the A_runner thus for its member a. But this variable was already allocated by its constructor.

I don’t understand why this implicit create is produced. Is there a way to prevent it?

Yes, this one is a bit tricky to understand,

To back-up a bit, the compiler runtime uses a “present” table that includes the host address, the corresponding device address, and the size in bytes of the device memory. A “partially present” error means that the size being looked up is larger than the size in found in the present table. Side note: we have a non-standard API call “acc_present_dump”, found in the “accel.h” header, which will show the contents of the present table.

The this pointer that’s triggering the error is A_runner’s this, not “a”. A_runner and it’s member “a” both have the same base host address. However “A_runner” is 24 bytes (because “aos” is included) while “a” is only 16 bytes. Hence the mismatch sizes and the error.

Parent classes/structs need to create the device data for their members, i.e. data creation is from the the top down. So here, you need to create “this” in A_runner’s constructor and the call a’s initDevice to create “x”. The init routine is optional in this case since “x” is public and you could instead add a “create(a.x[:n])” in A_runner’s constructor, but for encapsulation I still prefer using init.

For example:

#include <stdio.h>
#include <cstdlib>
#include <openacc.h>
#include <accel.h>

struct A {
    A( size_t N ) : n( N ), x( new int[N] ) {}

    void initDevice() {
        // no-op if the parent 'this' is already created, but keep in case A is used directly elsewhere
        #pragma acc enter data create(this)
        #pragma acc enter data create(x[0:n])
    }
    void deleteDevice() {
        #pragma acc exit data delete(x,this)
    }
    ~A() {
        delete x;
    }

    size_t n;
    int * x;
};

struct A_operator {
    void apply( A &a, int i ) {
        printf("%d ", a.x[i]);
    };
};

struct A_runner {
    A_runner( size_t N ) : a( N )  {
        #pragma acc enter data copyin(this)
        a.initDevice();
    };
    ~A_runner() {
        a.deleteDevice();
        #pragma acc exit data delete(this)
    }

    void applyOperator(){
        #pragma acc parallel loop gang present(a)
        for( int i=0; i<a.n; i++ ){
            aop.apply( a, i );
        }
        printf("\n");
    };

    A a;
    A_operator aop;
};

int main() {
    A_runner arun( 10 );
    arun.applyOperator();

    return 0;
}

And the compile and run:

% nvc++ -acc -Minfo=accel test.cpp; a.out
A::initDevice():
     13, Generating enter data create(x[:n],this[:1])
A::deleteDevice():
     16, Generating exit data delete(x[:1],this[:1])
A_operator::apply(A&, int):
     26, Generating implicit acc routine seq
         Generating acc routine seq
         Generating NVIDIA GPU code
A_runner::A_runner(unsigned long):
     34, Generating enter data copyin(this[:1])
A_runner::~A_runner():
     39, Generating exit data delete(this[:1])
A_runner::applyOperator():
     41, Generating present(a)
         Generating NVIDIA GPU code
         43, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     41, Generating implicit create(this[:]) [if not already present]
0 0 0 0 0 0 0 0 0 0

Hope this helps,
Mat

Thank you for the detailed explanation. So essentially the A_runner has to be copied entirely to the device? What if there are many members that are not required in the parallel section? I would like to avoid unnecessary copies. Is that possible?

Structs/classes are aggregate types so can’t be decomposed. The layout of the struct object needs to match between the host and device since member access are just offsets from the base address.

The reason why A_runner’s this pointer is needed is due to the hidden “this” pointers. For example, you use “a.n”, but this is really “this.a.n”, and “aop.apply( a, i )” is really “this.aop.apply( this.a, i )”.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.