Fake data movement triggered by implicit copies and present clauses traced with nsys

Dear staff,

I would like to ask for clarification regarding the behavior of present clause, declare directive for data movement and implicit copies in data regions spanning different routines/modules.
I am working on a Fortran application offloaded to GPUs with OpenACC, compiled with nvfortran -acc -Minfo=accel from the hpc-sdk/2022 suite and traced with nsys profile --trace=openacc.

I noticed that the following actions reported at compile time,

“X, Generating implicit copy* [if not already present]”

and

“Y, Generating present*”

map, in the Nsight System timeline view, to OpenACC events of the data-movement kind. These events are labelled as enter/exit data but embed only a Wait event with no Enqueue Upload/Download ; moreover there is not a corresponding memory operation in the cuda event panel.
My guess is that such events are triggered by checks at runtime of the presence of the variable on the device.

If this is correct, there are two points that I do not understand:
(1) Is it possible to use the present clause to declare that the variable is on the device and avoid a presence check?
(2) Is it possible to avoid this presence check, revealed by a implicit copyin [if not already present] in a subroutine (B, child) of a subroutine (A, parent), when the variable is copied to the device in the parent subroutine A (with the enter data directive) and used on the device in the child subroutine B?

To clarify my question, I attach a minimal script reproducing the behaviour mentioned above. Here, I copied to the device c with declare create in the module, b with enter data in the parent subroutine and a with declare copyin in the parent subroutine. The present of b is checked in the second loop with the present clause.

variables.f90 (165 Bytes)
inplacesum.f90 (605 Bytes)
main.f90 (194 Bytes)

By compiling with nvfortran -acc -Minfo=accel variables.f90 inplacesum.f90 main.f90, I get

variables.f90:
inplacesum.f90:
inplacesum:
      8, Generating copyin(a(:,:)) [if not already present]
      9, Generating enter data copyin(b(:,:))
     11, Generating exit data delete(b(:,:))
implicit_copies:
     19, Generating NVIDIA GPU code
         20, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
         21,   ! blockidx%x threadidx%x collapsed
     19, Generating implicit copyin(b(:,:),a(:,:)) [if not already present]
     25, Generating present(b(:,:))
     26, Generating NVIDIA GPU code
         27, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
         28,   ! blockidx%x threadidx%x collapsed
     26, Generating implicit copyin(a(:,:)) [if not already present]
main.f90:

By selecting the OpenACC events in the Events view, I see the following ones :

Name	
Device Init : inplacesum.f90:8	
Enter Data : inplacesum.f90:8	
Enter Data : inplacesum.f90:9	
*Enter Data : inplacesum.f90:19	
     Wait : inplacesum.f90:19	
Compute Construct : inplacesum.f90:19	
*Exit Data : inplacesum.f90:19	
*Enter Data : inplacesum.f90:25	
     Wait : inplacesum.f90:25	
*Enter Data : inplacesum.f90:26	
     Wait : inplacesum.f90:26	
Compute Construct : inplacesum.f90:26	
*Exit Data : inplacesum.f90:26
*Exit Data : inplacesum.f90:25	
Exit Data : inplacesum.f90:11	
Exit Data : inplacesum.f90:8	

In the list above, I resolved and marked the “fake data movement” (presence checks?) triggered by implicit copyin and present.

I also noticed the presence of an implicit copyin [if not present] on a at line 26, involving a copied with declare, while no implicit copies are reported for c. Is the declare directive within a subroutine equivalent to an enter data directive, with an (implicit) exit data at the end of the subroutine?

Thank you for your help,

Laura

Hi Laura,

It looks like you have a good understanding of the OpenACC directives but just need some clarification on what’s going on under the hood.

Compute directives, parallel or kernels, also have a data region. When the user does not explicitly include shared variables in a data clause (copy, copyin, copyout, present, deviceptr), or when the compute region is not within a visible data region, the compiler must add an implicit copy[in,out] for these variables. To disable the implicit copy generation, add “default(none)” to the compute directive. However, you will then need to explicitly add these variables to a data clause either on the compute region or in a surrounding structured data region else you will get a compilation error.

Here’s the relevant section of the OpenACC standard:

667 If there is no default(none) clause on the construct, the compiler will implicitly determine data
668 attributes for variables that are referenced in the compute construct that do not have predetermined
669 data attributes and do not appear in a data clause on the compute construct, a lexically containing
670 data construct, or a visible declare directive

Copy clauses, explicit or implicit, have “present or” semantics meaning the runtime checks if the variable is already present and only if not, does it perform the copy operation. “present” is an assertion that will fail at runtime if there is no device copy of the variable.

The runtime uses a hash table called the “present table” to do the present check but also does the mapping between the host and device pointer addresses so it can pass the device pointer to the compute kernel. This mapping is required but has very low overhead. There’s no “fake data movement” here.

Next, let’s walk through the nsys profile:

Time(%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)     Min (ns)    Max (ns)   StdDev (ns)                 Name
 -------  ---------------  ---------  ------------  ------------  ----------  ----------  -----------  -----------------------------------
    84.3       19,769,876          1  19,769,876.0  19,769,876.0  19,769,876  19,769,876          0.0  Enter Data@inplacesum.f90:8
     7.4        1,726,814          1   1,726,814.0   1,726,814.0   1,726,814   1,726,814          0.0  Enter Data@inplacesum.f90:9
     2.8          651,596          1     651,596.0     651,596.0     651,596     651,596          0.0  Wait@inplacesum.f90:8
     2.8          645,684          1     645,684.0     645,684.0     645,684     645,684          0.0  Wait@inplacesum.f90:9
     1.6          369,059          1     369,059.0     369,059.0     369,059     369,059          0.0  Device Init@inplacesum.f90:8
     0.3           72,878          1      72,878.0      72,878.0      72,878      72,878          0.0  Enqueue Upload@inplacesum.f90:8

First, you’ll notice that the copy of “a” seems to take significantly longer than the copy of “b”. This is not the case and the actual data movement time is about the same. If I add an unnecessary “!$acc update device(c)” in main.f90 the larger time moves:

 Time(%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)     Min (ns)    Max (ns)   StdDev (ns)                 Name
 -------  ---------------  ---------  ------------  ------------  ----------  ----------  -----------  -----------------------------------
    76.7       18,920,001          1  18,920,001.0  18,920,001.0  18,920,001  18,920,001          0.0  Update@main.f90:9
     7.1        1,745,478          1   1,745,478.0   1,745,478.0   1,745,478   1,745,478          0.0  Enter Data@inplacesum.f90:8
     6.6        1,619,449          1   1,619,449.0   1,619,449.0   1,619,449   1,619,449          0.0  Enter Data@inplacesum.f90:9

Data transfers between the host and device can only be performed using host pinned memory (i.e. memory allocated in physical memory that is not swappable). We use a double buffering system where virtual memory is copied to a buffer and transferred asynchronously to the device. As the first buffer is transferring, the second buffer is filled thus hiding much of the virtual to pinned memory copy time.

The creation of the buffers and CUDA stream used for async is delayed until the first time they are needed. Hence the extra time seen here is this overhead, not the data transfer. The “waits” is the time the host spends blocked waiting for the asynchronous transfer to complete.

Adding “cuda” to you nsys trace, i.e. “-t openacc,cuda”, we can see the actual data transfer time:

 Time (%)  Total Time (ns)  Count  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)      Operation
 --------  ---------------  -----  ---------  ---------  --------  --------  -----------  ------------------
    100.0        1,279,931      2  639,965.5  639,965.5   634,781   645,150      7,332.0  [CUDA memcpy HtoD]

You’ll notice that the time spent in the OpenACC data routines is roughly double that of the actual data movement time. This is because the buffer size is bigger than your arrays, hence the runtime isn’t able to hide the virtual memory copy.

If I set the environment variable “setenv NV_ACC_BUFFERSIZE 4000000”, i.e. set the buffer size to half of your array size, then we see some overlap:

 Time (%)  Total Time (ns)  Count  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)      Operation
 --------  ---------------  -----  ---------  ---------  --------  --------  -----------  ------------------
    100.0          888,188      4  222,047.0  221,727.0   212,159   232,575     11,061.9  [CUDA memcpy HtoD]
...
    16.4        1,349,307          1  1,349,307.0  1,349,307.0  1,349,307  1,349,307          0.0  Enter Data@inplacesum.f90:9

In general, the default buffer size is fine, it’s only because your arrays are relatively small that it’s too large.

To answer your questions directly:

(1) Is it possible to use the present clause to declare that the variable is on the device and avoid a presence check?

Adding a “present” clause (or a copy clause) avoids having the compiler implicitly add an implicit copy clause. Given the host to device address mapping, the present check will always be performed. The only way to avoid the present check is to use CUDA Fortran “device” arrays with the “deviceptr” clause. However the overhead of the present check is very small so unlikely worth the effort in having to manage the device data yourself.

(2) Is it possible to avoid this presence check, revealed by a implicit copyin [if not already present] in a subroutine (B, child) of a subroutine (A, parent), when the variable is copied to the device in the parent subroutine A (with the enter data directive) and used on the device in the child subroutine B?

The compiler has a limited scope on knowing if a compute region falls within a data region or not. Since the compute region is within a subroutine, it can’t tell if the subroutine is within a data region. Hence without the user explicitly adding the variable to a data clause in the same scope, the compiler adds the implicit copy clause. Though in most cases, there’s no real difference between explicitly adding the variable to data clause or the compiler implicitly adding them.

There’s a few cases, mainly with C/C++ when using aggregate types (classes, structs) with dynamic data members (pointers), where is does matter since the compiler can’t do implicit deep copies. But it’s less of a concern in Fortran since deep copies can be done when using the “-gpu=deepcopy” flag.

I also noticed the presence of an implicit copyin [if not present] on a at line 26, involving a copied with declare, while no implicit copies are reported for c. Is the declare directive within a subroutine equivalent to an enter data directive, with an (implicit) exit data at the end of the subroutine?

It’s scoping. “declare create(c)” is in a module (global scope) so the compiler can detect that “c” is within a data region.

“declare copyin(a)” is in the parent so the compiler can’t detect this given the compute region is in a different scoping unit.

Hope this helps,
Mat