"OpenACC" deepcopy support in current nvc releases? It works, but is it supported?

I am curious about the state of the OpenACC standard proposal deepcopy support in the HPC SDK compilers. I’m currently using version 21.9 due to update laziness on Ubuntu 20.04. A code example:

//deepcopy_test.c
#include <stdlib.h>
#include <stdio.h>

typedef struct points {
    float* x; float* y;
    int n;
    // no error with nvc, but is ignored unless either:
    // -ta=tesla:deepcopy or -gpu=deepcopy is used
    #pragma acc shape(x[0:n],y[0:n])
} points;

int main() {
    points p;
    p.n = 1000;
    p.x = ( float*) malloc ( sizeof ( float )*p.n );
    p.y = ( float*) malloc ( sizeof ( float )*p.n );
    #pragma acc data copy(p) //copy(p.x[:p.n]) copyin(p.y[:p.n])
    {
        #pragma acc parallel loop
        for (int i=0; i<p.n; ++i ) p.x[i] += p.y[i];
    }
    printf("all done, exiting\n");
}

The #pragma acc shape clauses are obviously convenient and there is a lot of documentation about it out there. It actually took me quite a while to figure out that it’s just a proposed extension and not part of the standard as of yet. So for quite a while I was confused by this:

$ nvc -Wall -O3 -g -Minfo=accel -acc=strict deepcopy_test.c -o deepcopy_test && NVCOMPILER_ACC_NOTIFY=3 ./deepcopy_test
main:
     19, Generating copy(p) [if not already present]
         Generating Tesla code
         21, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     21, Accelerator restriction: size of the GPU copy of p.x,p.y is unknown
upload CUDA data  file=/home/[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 variable=p bytes=24
launch CUDA kernel  file=/home/[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 num_gangs=1024 num_workers=1 vector_length=128 grid=1024 block=128
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

I’m using the -acc=strict option, which according to the nvc man page should cause the compiler to “Issue warnings when accelerator directives are encountered which do not adhere to the OpenACC standard.” - there is no warning, yet the pragma has no effect, the arrays are not copied to the GPU and hence the kernel fails. Same outcome with -acc=verystrict. Shouldn’t this non-standard pragma in combination with the strict option cause a warning?

Okay. So let’s look at the nvc -gpu documentation: according to this -gpu has an option deepcopy (yay!) which does “Enable full deep copy of aggregate data structures in OpenACC…” (sounds promising!) …“Fortran only” (sadface). Okay… but it works with -ta:tesla,deepcopy… Let’s try this anyway:

$ nvc -Wall -O3 -g -Minfo=accel -acc=strict -gpu=deepcopy deepcopy_test.c -o deepcopy_test && NVCOMPILER_ACC_NOTIFY=3 ./deepcopy_test
main:
     19, Generating copy(p) [if not already present]
         Generating Tesla code
         21, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     21, Accelerator restriction: size of the GPU copy of p.x,p.y is unknown
upload CUDA data  file=/home/[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 variable=p bytes=24
upload CUDA data  file=/home/[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 variable=.attach. bytes=8
upload CUDA data  file=/home/[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 bytes=4000
upload CUDA data  file=/home/[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 variable=.attach. bytes=8
upload CUDA data  file=/home/[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 bytes=4000
launch CUDA kernel  file=/home/[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 num_gangs=1024 num_workers=1 vector_length=128 grid=1024 block=128
download CUDA data  file=/home/[...]/deepcopy_test.c function=main line=22 device=0 threadid=1 bytes=4000
upload CUDA data  file=/home/[...]/deepcopy_test.c function=main line=22 device=0 threadid=1 variable=.detach. bytes=8
download CUDA data  file=/home/[...]/deepcopy_test.c function=main line=22 device=0 threadid=1 bytes=4000
upload CUDA data  file=/home/[...]/deepcopy_test.c function=main line=22 device=0 threadid=1 variable=.detach. bytes=8
download CUDA data  file=/home/[...]/deepcopy_test.c function=main line=22 device=0 threadid=1 variable=p bytes=24
all done, exiting

Huh. No Error. The compiler still complains that it does not know the size of the transfers in the parallel region, yet things get transferred both directions for the data clause anyway… So is the deepcopy extension supported for C (C++?) after all? Any caveats that I should be aware of?

The “shape” and “policy” directives are considered experimental and a beta feature, but yes, should work in C/C++. Though we haven’t worked on this feature in some time and don’t plan on updating until it’s adopted into the OpenACC standard, which I have no idea as to when or if this will occur. So I wouldn’t call it a “supported extension” given beta features may change or go away, in this case depending on what gets adopted by the standard. It was just meant to get feedback from users and establish a proof of concept implementation.

Below is a standard compliant version that you can use. I’m using unstructured data regions since gcc doesn’t like when the parent and it’s data members are on the same directive. Not as clean of a solution as using “shape”, but is more portable.

#include <stdlib.h>
#include <stdio.h>

typedef struct points {
    float* x; float* y;
    int n;
} points;

int main() {
    points p;
    p.n = 1000;
    p.x = ( float*) malloc ( sizeof ( float )*p.n );
    p.y = ( float*) malloc ( sizeof ( float )*p.n );
    #pragma acc enter data copyin(p)
    #pragma acc enter data copyin(p.x[:p.n])
    #pragma acc enter data copyin(p.y[:p.n])

    #pragma acc parallel loop present(p)
    for (int i=0; i<p.n; ++i ) p.x[i] += p.y[i];
    #pragma acc exit data copyout(p.x[:p.n])
    #pragma acc exit data delete(p.y[:p.n])
    #pragma acc exit data delete(p)
    printf("all done, exiting\n");
}

As for “-acc=strict|verystrict” not flaging shape as an extension, yes it probably should. We added this mostly to help users find the old PGI Accelerator model directives in their codes (OpenACC is based on the PGI Accelerator model, a Cray model, and another from CAPPS). If needed, I can add a problem report and get it fixed.

For “-gpu=deepcopy”, what’s officially supported is the compiler implicitly performing the deep copy. But this is only available in Fortran since Fortran arrays contain bounds information that the compiler runtime can use. In C/C++, arrays are really just unbounded pointers, and why something like “shape” is needed.

Now, I’m not trying to dissuade for using “shape”, it’s a very useful feature, but just want to be open about it being experimental and may change.

-Mat

Thank you for the quick, exhaustive and very helpful answer, as always.

Regarding your example:
Could I achieve the same thing with two nested data clauses, or one data clause with the parent and a data clause with the child in the parallel loop directive? Just checking since I like the scoping idea for not missing an exit clause.

Regarding “-gpu=deepcopy”:
I get that the compiler needs the shape clause to have the relevant information in C/C++. Is the combination of “-gpu=deepcopy” and shape clauses on a similar level of “beeing supported” as the Fortran version?

Since I’m in a corporate setting I guess that means no shape clauses for me, especially since I couldn’t find it in the 3.2 standard still. I’ll try to keep it in mind though and provide feedback if I stumble upon some strange things. So far it looks very useful and pretty straight forward to me, to the point where I’m not sure why it wasn’t included long ago. I may sometimes want to be more specific than is possible with shape clauses (copyin vs copy for example), but that does not really remove much from the shape clause and is about the only feedback I have right now.

Regarding “-acc=strict|verystrict”:
I would vote for flagging this as an extension, yes. As I said, there is a lot of information on it out there, it took me quite a while to figure out that it is NOT part of the standard (yet). If I make other mistakes in the pragmas, I get errors. Here it just get’s ignored without comment. Even when I finally suspected this might be an extension and I used “-acc=strict” I did not get any comments.

Actually I’m thinking it may even be beneficial to give a warning “pragma ignored” even without “-acc=strict” if “-gpu=deepcopy” is not used? I probably spent a day investigating this and that investigation would have been a lot more directed if I knew that the shape clause was correct but being ignored deliberately.

The “deepcopy” flag is overloaded here. The supported part is the true deepcopy, i.e. the compiler implicitly traversing the data structure, without the need for the “shape” directive.

Since I’m in a corporate setting I guess that means no shape clauses for me, especially since I couldn’t find it in the 3.2 standard still.

As I mentioned, “shape” is experimental so not something I’d advise adding to production code.

I would vote for flagging this as an extension, yes

Sounds good. I added an issue report, TPR #31369, and sent it to engineering.

-Mat

Thank you for the clarifications.

This should also work for gcc?

#include <stdlib.h>
#include <stdio.h>

typedef struct points {
    float* x; float* y;
    int n;
} points;

int main() {
    points p;
    p.n = 1000;
    p.x = ( float*) malloc ( sizeof ( float )*p.n );
    p.y = ( float*) malloc ( sizeof ( float )*p.n );
    #pragma acc data copyin(p)
    {
        #pragma acc parallel loop copy(p.x[:p.n]) copyin(p.y[:p.n])
        for (int i=0; i<p.n; ++i ) p.x[i] += p.y[i];
    }
    printf("all done, exiting\n");
}

I’m not positive, but doubt it.

In OpenACC, it’s illegal to put the same variable in different copy clause to avoid ambiguous behavior. The error I’ve seen from gcc is due to it thinking “p,x” and “p,y” as the same variable given the common parent. So having them on the same line is likely to trigger the error. Though I haven’t actually tried this particular example, so could be wrong.

In general, I’ve moved away from using structured data regions and favor unstructured regions. Most since I can then add “enter data” directive directly after the arrays are allocated, and “exit data” directly before the “free”. Hence the lifetime and scope of both the host and device copy of the data is the same. This makes it much easier to port large applications.

I then take a bottom-up incremental approach to offloading the compute. Put an “update” directive before and after the offloaded loop to synchronize the data. Then as more loops are offloaded, one at a time, widen where the “update” occurs. Eventually once all compute is offloaded, only the minimal amount of data movement occurs, or removed altogether.

Not to say that this method can be applied in every case, nor that structured data regions don’t have a use, over the years I’ve just found that this method is less error prone and more productive then trying to widen the scope of structured regions.

-Mat

Isn’t giving up on structured regions sort of a step back since you give up on the scope based resource management convenience? I can see how your approach might scale better with larger applications though. I will certainly keep this in mind. Any similar wisdom regarding simply using unified memory? ;) On Volta/Turing and later at least?

As a sort of followup, if I may: I’m trying to extend this example so that there is an array of points:

//deepcopy_test.c
#include <stdlib.h>
#include <stdio.h>

typedef struct points {
    float* x; float* y;
    int n;
    // no error with nvc, but is ignored unless either:
    // -ta=tesla:deepcopy or -gpu=deepcopy is used
    #pragma acc shape(x[0:n],y[0:n])
} points;

#define PLEN 3
int main() {
    points p[PLEN];
    for (int j=0; j<PLEN; j++) {
        p[j].n = 1000;
        p[j].x = ( float*) malloc ( sizeof ( float )*p[j].n );
        p[j].y = ( float*) malloc ( sizeof ( float )*p[j].n );
    }
        #pragma acc parallel loop copy(p[:PLEN])
        for (int j=0; j<PLEN; j++) {
            {
                #pragma acc loop
                for (int i=0; i<p[j].n; ++i ) p[j].x[i] += p[j].y[i];
            }
        }
    printf("all done, exiting\n");
}

This works with -gpu=deepcopy, but I’m at a loss how to do this in a standard compliant way?

Slightly, but only in that you need to use the “present” clause more. Not to say the structured data regions don’t have their place, only that in general it’s best to hoist the data regions as high up in the program as possible so device data isn’t being allocated/deallocated multiple times. Also it gives more fine grain control over when data is synchronized (via the update directive) and limit unnecessary data movement.

Any similar wisdom regarding simply using unified memory?

The caveat with unified memory is that currently only can be used with heap memory (allocated) and doesn’t work with CUDA aware MPI. Other than that, it works great.

While you’re porting code, especially if incrementally adding compute regions, performance may not be good since there’s extra data movement, but that’s not much different from the top-down data approach I noted earlier. The main difference is that the data movement occurs during the execution of the kernel so it makes it difficult to do kernel performance tuning until all compute has been offloaded and data movement is minimized.

Here’s your code updated to use manual deep copy. I also added initialization, printing, and deallocation. Some of these steps could be combined, but in larger codes they’re often separated (such as if you’re using a C++ class), so I kept them separate.

% cat deepcopy_test.c
//deepcopy_test.c
#include <stdlib.h>
#include <stdio.h>

typedef struct points {
    float* x; float* y;
    int n;
} points;

#define PLEN 3
int main() {
    points p[PLEN];
#pragma acc enter data create(p[0:PLEN])
    for (int j=0; j<PLEN; j++) {
        p[j].n = 1000;
        p[j].x = ( float*) malloc ( sizeof ( float )*p[j].n );
        p[j].y = ( float*) malloc ( sizeof ( float )*p[j].n );
#pragma acc update device(p[j].n)
#pragma acc enter data create(p[j].x[:p[j].n])
#pragma acc enter data create(p[j].y[:p[j].n])
    }
    for (int j=0; j<PLEN; j++) {
        for (int i=0; i<p[j].n; ++i) {
            p[j].x[i] = 1.0;
            p[j].y[i] = 2.0;
        }
#pragma acc update device( p[j].x[:p[j].n],  p[j].y[:p[j].n])
    }
        #pragma acc parallel loop present(p)
        for (int j=0; j<PLEN; j++) {
            {
                #pragma acc loop
                for (int i=0; i<p[j].n; ++i ) {
                     p[j].x[i] += p[j].y[i];
                }
            }
        }

    for (int j=0; j<PLEN; j++) {
#pragma acc update self(p[j].x[:p[j].n])
        printf("%d: ",j);
        for (int i=0; i<4; ++i) {
           printf("%f ",p[j].x[i]);
        }
        printf("\n");
    }

    for (int j=0; j<PLEN; j++) {
#pragma acc exit data delete(p[j].x, p[j].y)
        free(p[j].x);
        free(p[j].y);
    }
#pragma acc exit data delete(p)
    printf("all done, exiting\n");
}
% nvc deepcopy_test.c -acc -Minfo=accel ; a.out
main:
     12, Generating enter data create(p[:])
     21, Generating update device(p.n)
         Generating enter data create(p.x[:p.n],p.y[:p.n])
     28, Generating update device(p.x[:p.n],p.y[:p.n])
         Generating present(p[:])
         Generating NVIDIA GPU code
         30, #pragma acc loop gang /* blockIdx.x */
         33, #pragma acc loop vector(128) /* threadIdx.x */
     33, Loop is parallelizable
     39, Generating update self(p.x[:p.n])
     50, Generating exit data delete(p.x[:1],p.y[:1])
     54, Generating exit data delete(p[:])
0: 3.000000 3.000000 3.000000 3.000000
1: 3.000000 3.000000 3.000000 3.000000
2: 3.000000 3.000000 3.000000 3.000000
all done, exiting

-Mat

If there is no loop handy to put statements in? Conditional compilation?

#ifdef _OPENACC
for (int j=0; j<PLEN; j++) {
#pragma acc update device(p[j].n)
}
#endif

Is it common practice to not indent pragmas, btw?

Sure, though if OpenACC isn’t used, the compiler should optimize away the empty loop. So it probably doesn’t matter in this case.

Is it common practice to not indent pragmas, btw?

No, it’s just personal preference and I’m not very consistent. I tend to put them in the first column just to make them stand out, but will also indent them when on a loop.

-Mat