I am curious about the state of the OpenACC standard proposal deepcopy support in the HPC SDK compilers. I’m currently using version 21.9 due to update laziness on Ubuntu 20.04. A code example:
//deepcopy_test.c
#include <stdlib.h>
#include <stdio.h>
typedef struct points {
float* x; float* y;
int n;
// no error with nvc, but is ignored unless either:
// -ta=tesla:deepcopy or -gpu=deepcopy is used
#pragma acc shape(x[0:n],y[0:n])
} points;
int main() {
points p;
p.n = 1000;
p.x = ( float*) malloc ( sizeof ( float )*p.n );
p.y = ( float*) malloc ( sizeof ( float )*p.n );
#pragma acc data copy(p) //copy(p.x[:p.n]) copyin(p.y[:p.n])
{
#pragma acc parallel loop
for (int i=0; i<p.n; ++i ) p.x[i] += p.y[i];
}
printf("all done, exiting\n");
}
The #pragma acc shape
clauses are obviously convenient and there is a lot of documentation about it out there. It actually took me quite a while to figure out that it’s just a proposed extension and not part of the standard as of yet. So for quite a while I was confused by this:
$ nvc -Wall -O3 -g -Minfo=accel -acc=strict deepcopy_test.c -o deepcopy_test && NVCOMPILER_ACC_NOTIFY=3 ./deepcopy_test
main:
19, Generating copy(p) [if not already present]
Generating Tesla code
21, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
21, Accelerator restriction: size of the GPU copy of p.x,p.y is unknown
upload CUDA data file=/home/[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 variable=p bytes=24
launch CUDA kernel file=/home/[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 num_gangs=1024 num_workers=1 vector_length=128 grid=1024 block=128
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
I’m using the -acc=strict
option, which according to the nvc
man page should cause the compiler to “Issue warnings when accelerator directives are encountered which do not adhere to the OpenACC standard.” - there is no warning, yet the pragma has no effect, the arrays are not copied to the GPU and hence the kernel fails. Same outcome with -acc=verystrict
. Shouldn’t this non-standard pragma in combination with the strict option cause a warning?
Okay. So let’s look at the nvc -gpu documentation: according to this -gpu
has an option deepcopy
(yay!) which does “Enable full deep copy of aggregate data structures in OpenACC…” (sounds promising!) …“Fortran only” (sadface). Okay… but it works with -ta:tesla,deepcopy
… Let’s try this anyway:
$ nvc -Wall -O3 -g -Minfo=accel -acc=strict -gpu=deepcopy deepcopy_test.c -o deepcopy_test && NVCOMPILER_ACC_NOTIFY=3 ./deepcopy_test
main:
19, Generating copy(p) [if not already present]
Generating Tesla code
21, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
21, Accelerator restriction: size of the GPU copy of p.x,p.y is unknown
upload CUDA data file=/home/[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 variable=p bytes=24
upload CUDA data file=/home/[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 variable=.attach. bytes=8
upload CUDA data file=/home/[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 bytes=4000
upload CUDA data file=/home/[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 variable=.attach. bytes=8
upload CUDA data file=/home/[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 bytes=4000
launch CUDA kernel file=/home/[...]/deepcopy_test.c function=main line=19 device=0 threadid=1 num_gangs=1024 num_workers=1 vector_length=128 grid=1024 block=128
download CUDA data file=/home/[...]/deepcopy_test.c function=main line=22 device=0 threadid=1 bytes=4000
upload CUDA data file=/home/[...]/deepcopy_test.c function=main line=22 device=0 threadid=1 variable=.detach. bytes=8
download CUDA data file=/home/[...]/deepcopy_test.c function=main line=22 device=0 threadid=1 bytes=4000
upload CUDA data file=/home/[...]/deepcopy_test.c function=main line=22 device=0 threadid=1 variable=.detach. bytes=8
download CUDA data file=/home/[...]/deepcopy_test.c function=main line=22 device=0 threadid=1 variable=p bytes=24
all done, exiting
Huh. No Error. The compiler still complains that it does not know the size of the transfers in the parallel region, yet things get transferred both directions for the data clause anyway… So is the deepcopy extension supported for C (C++?) after all? Any caveats that I should be aware of?