OpenACC copy clause for pointer member of struct does not get attached if parent is copied implicitly

This is loosely related to "OpenACC" deepcopy support in current nvc releases? It works, but is it supported? - similar example, different question. I noticed some (to me) weird behavior with struct pointer member attach / detach. It seems like nvc only creates attach / detach statements when the struct itself is explicitly copied, not when that happens implicitly.

I am using NVHPC SDK 21.9 on Ubuntu 20.04 and the following code:

//deepcopy_test.c
#include <stdlib.h>
#include <stdio.h>

typedef struct points {
    float* x; float* y;
    int n;
} points;

int main() {
    points p;
    p.n = 1000;
    p.x = ( float*) malloc ( sizeof ( float )*p.n );
    p.y = ( float*) malloc ( sizeof ( float )*p.n );
    // working pragma:
    //#pragma acc data copy(p) copy(p.x[:p.n]) copyin(p.y[:p.n])
    // generates implicit copy(p) but no attach for p.x, p.y:
    #pragma acc data copy(p.x[:p.n]) copyin(p.y[:p.n])
    {
        #pragma acc parallel loop
        for (int i=0; i<p.n; ++i ) p.x[i] += p.y[i];
    }
    printf("all done, exiting\n");
}

Compiling this yields:

$ nvc -Wall -O3 -g -Minfo=accel -acc=strict deepcopy_test.c -o deepcopy_test && NVCOMPILER_ACC_NOTIFY=3 ./deepcopy_test
main:
     19, Generating copy(p.x[:p.n]) [if not already present]
         Generating Tesla code
         21, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     19, Generating implicit copyin(p) [if not already present]
         Generating copyin(p.y[:p.n]) [if not already present]
upload CUDA data  file=[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 bytes=4000
upload CUDA data  file=[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 bytes=4000
upload CUDA data  file=[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 variable=p bytes=24
launch CUDA kernel  file=[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 num_gangs=1024 num_workers=1 vector_length=128 grid=1024 block=128
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

Notice how the two pointer members are created and copied, but not attached. While with the other pragma line which contains an explicit copy statement for p as well everything works:

$ nvc -Wall -O3 -g -Minfo=accel -acc=strict deepcopy_test.c -o deepcopy_test && NVCOMPILER_ACC_NOTIFY=3 ./deepcopy_test
main:
     19, Generating copy(p) [if not already present]
         Generating copyin(p.y[:p.n]) [if not already present]
         Generating copy(p.x[:p.n]) [if not already present]
         Generating Tesla code
         21, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
upload CUDA data  file=[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 variable=p bytes=24
upload CUDA data  file=[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 variable=.attach. bytes=8
upload CUDA data  file=[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 bytes=4000
upload CUDA data  file=[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 variable=.attach. bytes=8
upload CUDA data  file=[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 bytes=4000
launch CUDA kernel  file=[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 num_gangs=1024 num_workers=1 vector_length=128 grid=1024 block=128
upload CUDA data  file=[...]/deepcopy_test.c function=main line=22 device=0 threadid=1 variable=.detach. bytes=8
download CUDA data  file=[...]/deepcopy_test.c function=main line=22 device=0 threadid=1 variable=p bytes=24
download CUDA data  file=[...]/deepcopy_test.c function=main line=22 device=0 threadid=1 bytes=4000
all done, exiting

I’m not sure if this is a bug or not? Mat Colgrove already mentioned that the spec forbids putting the same variable in multiple copy statements to avoid ambiguity and that gcc in particular does not like it. Maybe I should stop trying it anyway? ;) Reading the OpenACC 2.7 spec I haven’t found that yet (have not searched specifically for it though), if anyone knows specifics, please share. For copy clauses though, copy(p.x[:p.n]) is a pointer reference? Hence it should trigger an attach? Otherwise, how would I reference the memory that got created?

Any insights are highly appreciated.

Order matters. Here, the implicitly copy of “p” occurs after the explicit copy of it’s data member. In order to implicitly do the attach, “p” must be on the device prior to it’s data members. So here, you do need to add “p” to a copy clause before the appearance of it’s data members, though I’d advise using “copyin” instead of “copy”, else the device pointers will be copied back the host and overwrite the host pointers.

I’m not sure if this is a bug or not?

Not a bug in the compiler, just the implicitly copy occurring after explicit copies and the parent needs to be present in order for the runtime to attach it’s data members.

Maybe I should stop trying it anyway?

Well, don’t stop trying, but I still suggest you switch to using unstructured data region as I show in my response: "OpenACC" deepcopy support in current nvc releases? It works, but is it supported? - #8 by MatColgrove

While more verbose, unstructured data regions are easier to work with for manual deep copy, especially with larger and more complex data structures as well as larger codes. Plus they can be encapsulated into subroutines and more closely follow how a deep copy would be performed when copying one structure to another in host code.

-Mat

Hm… shouldn’t this cause an error? (Or enforce the order?) The spec for the copy clause, both 2.7 and 3.2 version states: “If var is a pointer reference, an attach action is performed.”. Without p on the device, can there be an attach action? If there is no attach action, how do I access the device copy that got created? There seem to be no valid variable on the device to access the data region that got created and copied.

Regarding copying back device pointers: there should be a detach action if there was an attach action, and the detach restores the host pointer. With the usual caveats of ensuring that there is a matching “free” at least. So that should be fine?

Correct, but this is for the pointer reference itself, not the for the parent child relationship. It’ s just saying that the device copy of pointer address will be set (i.e. attached) to point at the allocated block of device memory.

(Or enforce the order?)

It does enforce order, in that implicit copies are performed after explicit copies. Doing it the other way could present issues, especially with aliased pointers.

shouldn’t this cause an error?

Well, it’s legal to just copy in data member. Accessing it on the device via the parent would cause a runtime error, but it’s not uncommon to use an aliased pointer to the data member for access on the device.

I wouldn’t advocate for an error since there’s nothing illegal about the code nor would want to have the compiler presume what the user intends, but I’ll check to see it there’s a way we can issue a warning to at least highlight a potential problem.

Regarding copying back device pointers: there should be a detach action if there was an attach action, and the detach restores the host pointer. With the usual caveats of ensuring that there is a matching “free” at least. So that should be fine?

The problem isn’t with the detach, but rather the copy back. A “copy(p)” says to perform a shallow copy of the structure to and from the device. This is for all data members, including pointers. So copying back a structure will overwrite the host structure.

Keep in mind a pointer is just a scalar with an address in it. That address could be a host address, it could be a device address (allocated via acc_malloc or cudamalloc), or even a unified memory address. The compiler can’t tell which and why it’s up to the user to make these decisions.

Not sure it’s relevant, but I’m trying to think of where it gets attached to… Is a struct p created on the device, the pointer attached to it and then it’s overwritten with the host data due to the implicit copy of p that follows?

I get your point about not creating an error. Still, this constellation silently creates inaccessible memory? A warning about that would be very helpful. I think even without trying to divine user intention we can assume that a copy statement to the device means that the data is intended to be used on the device.

I wish I had a whiteboard since this would be much easier to explain by drawing it out. Though I’ll give it try without.

First, “p” isn’t a pointer. It’s a fixed size aggregate data type containing three data members, two scalar pointers and an int, for a total of ~20 bytes. (I approximate the size since to could be padded for alignment). When “p” is created on the device, 20 bytes is allocated and copied to the device via a memcpy type operation. This is what’s meant by “shallow copy”, i.e. it’s a direct copy of the host “p” to the device copy of “p”. The pointers members are treated no different than any other scalar data members. Fortran arrays include a descriptor containing information about the array (such as bounds), but in C/C++ it’s just a scalar with no additional information. This is why we can do an implicit deep copy in Fortran, but not in C/C++.

The reverse is true as well. When copying “p” from the device to the host, a simple memcpy is performed on the object and any values in the struct, including any device address, overwriting the values in the host copy.

Note for your second example using “p[PLEN]”, p still isn’t a pointer but a fixed sized object of 20*PLEN in size. So the same shallow copy operation would apply, just on the larger object.

If “PLEN” was replaced with a variable, then “p” would be a VLA (variable length array) which is a pointer that gets implicitly allocated / deallocated within the scoping unit it’s declared. So for a VLA or if p was changed to be a pointer, i.e. “struct *p”, and then allocated, the compiler runtime would first create the scalar pointer on the device, allocate the memory block associated with the array, then “attach” p to this memory block, i.e. update p’s value to the starting address of this block.

For the pointer data members a similar operation is done. When “copy(p.x[:p.n])” is encountered, a device memory block is allocated and the value of “x” is set to the starting address of this block. If “p” is on the device, then “p.x” pointer itself is already present on the device and hence “p.x” is attached. However if “p” is not on the device, then “x” gets created independently given it’s not already present. If “p” is created after “x”, this “p” will have a distinct “p.x” unassociated with previously created “x”. It would be up to the programmer to then explicitly create this association via the “attach” data clause or a call to the “acc_attach” API. It’s similar to this C code:

    float *dev_x;
    // allocate a block of memory, and attach the address to the pointer
    dev_x = ( float*) malloc ( sizeof ( float )*size );
    // create the struct
    points p;
    // attach the previously allocated memory to p.x
    p.x = dev_x;

There are programs that use this technique to create a device memory pool and then attach pointers to sections of this pool.

Still, this constellation silently creates inaccessible memory?

It’s not really any different than if the program deallocated a null pointer or other bad address. The compiler can help find syntax error and some semantic errors, but it’s up to the programmer to ensure the code is correct.

A warning about that would be very helpful.

Understood. I haven’t had a chance to talk with engineering to see if it’s even possible for the compiler to detect this situation, but will today. If they think it could be done, I’ll then add an RFE.

-Mat