Implicit data copy to device for allocated arrays using compilation option -stdpar=gpu

rene.puttin · May 15, 2023, 7:44am

Dear NVIDIA-HPC team,

the following test code (provided in C and Fortran) shows an unexpected behaviour when compiled with option -stdpar=gpu.
The OpenACC directive specifies array x as present, but the array is not copied to the device beforehand. Therefore the code should abort.

#include <stdio.h>

int main(void)
{

    const int n = 1000000;
#ifndef ALLOC
    float x[n];
#else
    float* x;
#endif
    float avg_i;
    int i;

#ifdef ALLOC
    x = (float*) malloc(n * sizeof(float));
#endif

    for (i = 0; i < n; i++)
       x[i] = 1.0;

    avg_i = 0.0;
    for (i = 0; i < n; i++)
       avg_i += x[i];
    avg_i /= n;
    printf("avg(x): %f\n", avg_i);

#pragma acc parallel loop present(x)
    for (i = 0; i < n; i++)
       x[i] = x[i] + 1.0;
    printf("cudaError: %s\n", cudaGetErrorString(cudaGetLastError()));

    avg_i = 0.0;
    for (i = 0; i < n; i++)
       avg_i += x[i];
    avg_i /= n;
    printf("avg(x): %f\n", avg_i);

#ifdef ALLOC
    free(x);
#endif

}

The code behaves correctly when compiled with the following options:
nvc++ -cpp -acc=gpu -Minfo=all -fast test_simple.cu -o test_c_simple

It aborts (as expected since a present clause without previous data copy is used) with the array message:
FATAL ERROR: data in PRESENT clause was not found on device 1: name=x[:] host:0x2afdf9389010 file:/home/rputtin/test/OpenACC/saxpy/test_simple.cu main line:29

In order to compile the code with manually allocated arrays the preprocessor macro ALLOC can be used:
nvc++ -DALLOC -cpp -acc=gpu -Minfo=all -fast test_simple.cu -o test_c_simple
With these options the code still behaves correctly (and aborts).

But when compiler option -stdpar=gpu is added the code finishes without abort:
nvc++ -DALLOC -cpp -stdpar=gpu -acc=gpu -Minfo=all -fast test_simple.cu -o test_c_simple
The code finishes properly and produces correct results. This seems to imply that the array x is copied to and from the device although the present clause is set.

The same behaviour can also be overserved with the corresponding Fortran code:

program main

    use cudafor

    implicit none

    integer, parameter :: n = 1000000
#ifndef ALLOC
    real :: x(n)
#else
    real,allocatable :: x(:)
#endif
    integer :: i

#ifdef ALLOC
    allocate(x(n))
#endif

    do i = 1, n
       x(i) = 1.0
    enddo

    print *, "avg(x):", sum(x)/n
!$acc parallel loop present(x)
    do i = 1, n
       x(i) = x(i) + 1.0
    end do
!$acc end parallel loop
    print *, "cudaError:", cudaGetErrorString(cudaGetLastError());
    print *, "avg(x):", sum(x)/n

#ifdef ALLOC
    deallocate(x)
#endif

end program

This simplified test code is an excerpt from a more complex test code, which measures times for computation and data copy to/from device separately.
In the more complex test code one can explictly observe timings for copying to / from device (triggered by OpenACC / OpenMP directives). But when using manually allocated arrays and option -stdpar=gpu the explicit copying times disappear, but the timings in the parallel loops increase by an equivalent amount of time.

So, in summary it seems that compiler option -stdpar=gpu disables manual data handling for manually allocated arrays but instead generates implicit data copying operation, even when not necessary.

Is there a reason for this behaviour? Is this intended or is a compiler problem?

Thanks a lot your support and regards,

Rene’

MatColgrove · May 15, 2023, 4:32pm

Hi Rene,

When using C++ or Fortran Standard Language Parallelism (i.e. -stdpar), CUDA Unified Memory is enabled by default (i.e. -gpu=managed is set). UM is only available for use with allocated heap memory and why the data is present when you allocate the data. However when using static memory, the memory is not present on the device, and hence the error.

To disable UM when using -stdpar, add the flag “-gpu=nomanaged”.

% nvc++ -DALLOC -cpp -stdpar=gpu -acc=gpu -Minfo=all -fast test_simple.cu -o test_c_simple -gpu=nomanaged
main:
     19, Memory set idiom, loop replaced by call to __c_mset4
     23, Loop not fused: function call before adjacent loop
         Generated vector simd code for the loop containing reductions
     29, Generating present(x[:])
         Generating NVIDIA GPU code
         29, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     34, Loop not fused: function call before adjacent loop
         Generated vector simd code for the loop containing reductions
% ./test_c_simple
avg(x): 1.000000
hostptr=0x14d5d1e07010,stride=1,size=1,extent=-1,eltsize=4,name=x[:],flags=0x200=present,async=-1,threadid=1
FATAL ERROR: data in PRESENT clause was not found on device 1: name=x[:] host:0x14d5d1e07010
 file:.../test_simple.cu main line:29

Hope this helps,
Mat

rene.puttin · May 16, 2023, 7:30am

Hi Mat,

thanks for the information. At least for Fortran I can confirm that disabling UM changes the behaviour and that I can measure copying and execution times separately once again. My PSTL code aborts with the following error message, when using manually allocated arrays and disabled UM:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

I will investigate this and let you know.

Thanks a lot for your help.

MatColgrove · May 16, 2023, 5:14pm

STDPAR needs UM enabled. (or the data managed via OpenACC data directives), else the code is accessing host addresses and will give this error.

rene.puttin · May 25, 2023, 5:37am

Hi Mat,

sorry for the late reply. Meanwhile I found time to implement and run some further tests. I am experiencing the following behaviour:

Fortran do concurrent code seems to work properly with UM enabled or disabled. Even if I do not preload the data with OpenACC the results are correct, but the timing is slower due an implicit copy.
Is this expected? From your previous answer I would have expected, that STDPAR code without UM and no preload should abort. Or does this only happens for PSTL and not Fortran do concurrent code?

On the other hand my PSTL code with UM disabled even aborts, even when the data is preloaded with OpenACC.

This is my C test code:

#include <stdio.h>
#include <sys/time.h>
#include <algorithm>
#include <execution>
#include <ranges>

int main(void)
{

    const int n = 10000000;
    float *x, *y;
    float a = 2.0;
    float avg;
    int i;

    x = (float*) malloc(n * sizeof(float));
    y = (float*) malloc(n * sizeof(float));

    for (i = 0; i < n; i++)
    {
       x[i]  = 2.0;
       y[i]  = 1.0;
    }

    auto i_range = std::ranges::iota_view{0,n};
    auto i_begin_ = std::ranges::begin(i_range);

#pragma acc enter data copyin(y,x)

    std::for_each_n(std::execution::par_unseq, i_begin_, n, [=](int i)
    {
      y[i] = a * x[i] + y[i];
    });

#pragma exit data copyout(y)

    avg = 0.0;
    for (i = 0; i < n; i++)
       avg += y[i];
    avg = avg/n;

    printf("PSTL loop:                %f\n", avg);

    free(x);
    free(y);
}

When compiled without UM usage using compile command

nvc++     --gcc-toolchain=/opt/bm/gcc-11.1.0 -std=c++20 -DALLOC      -acc=gpu -stdpar=gpu -Minfo=all -fast test_simple_pstl.cu  -gpu=nomanaged -o test_simple_c_pstl

the code aborts with the following error message:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

When compiling it with UM usage using compile command

nvc++     --gcc-toolchain=/opt/bm/gcc-11.1.0 -std=c++20 -DALLOC      -acc=gpu -stdpar=gpu -Minfo=all -fast test_simple_pstl.cu  -o test_simple_c_pstl

the code runs properly.

If I understand your advice correctly this should not happen since the data is already preloaded via OpenACC. I am not that familiar with PSTL, yet. Perhaps the root cause is that I am using PSTL incorrectly here.

Thanks a lot for your support!

MatColgrove · May 25, 2023, 4:45pm

For our implementation,

DO CONNCURRENT is built on top of OpenACC, so will do the implicit data copy when entering each compute region if not managed by UM or via data directives.

C++ STDPAR is built on top of Thrust, hence when not using UM, device pointers must be used. This can be pointers allocated by cudaMalloc, or acc_malloc.

Here’s an example:

% cat test.cpp
#include <stdio.h>
#include <sys/time.h>
#include <algorithm>
#include <execution>
#include <ranges>
#include <openacc.h>

int main(void)
{

    const int n = 10000000;
    float *xDev, *yDev, *x, *y;
    float a = 2.0;
    float avg;
    int i;

    xDev = (float*) acc_malloc(n * sizeof(float));
    yDev = (float*) acc_malloc(n * sizeof(float));
    y = (float*) malloc(n * sizeof(float));
    x = (float*) malloc(n * sizeof(float));

    for (i = 0; i < n; i++)
    {
       x[i]  = 2.0;
       y[i]  = 1.0;
    }
    acc_memcpy_to_device(yDev,y,sizeof(float)*n);
    acc_memcpy_to_device(xDev,x,sizeof(float)*n);

    auto i_range = std::ranges::iota_view{0,n};
    auto i_begin_ = std::ranges::begin(i_range);

    std::for_each_n(std::execution::par_unseq, i_begin_, n, [=](int i)
    {
      yDev[i] = a * xDev[i] + yDev[i];
    });

    acc_memcpy_from_device(y,yDev,sizeof(float)*n);
    avg = 0.0;
    for (i = 0; i < n; i++)
       avg += y[i];
    avg = avg/n;

    printf("PSTL loop:                %f\n", avg);

    free(y);
    free(x);
    acc_free(yDev);
    acc_free(xDev);
}
% nvc++ -std=c++20 -acc=gpu -stdpar=gpu -fast test.cpp -gpu=nomanaged -o test_simple_c_pstl
% ./test_simple_c_pstl
PSTL loop:                5.000000
% nvc++ -std=c++20 -acc=gpu -stdpar=gpu -fast test.cpp -o test_simple_c_pstl
% ./test_simple_c_pstl
PSTL loop:                5.000000

rene.puttin · May 26, 2023, 6:04am

Hi Matt,

thanks a lot for your valueable input. I have tested with PSTL without UM using device pointers and can confirm that it works.

I have also implemented this approach in my more complex program comparing PSTL, OpenACC, OpenMP and CUDA performance.
In this program the PSTL code also works properly now, when UM is disabled, but the following OpenACC code aborts with the following message when UM is disabled:

Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

Is it also necessary to change anything within the OpenACC code (in C) when disabling UM?
Should a code fragment like this still work without any explicit data copying when UM is disabled?

#pragma acc parallel loop
    for (i = 0; i < n; i++)
       y1_acc[i] = a * x1_acc[i] + y1_acc[i];

Thanks a lot for your advice in advance.

MatColgrove · May 26, 2023, 3:58pm

As to why you’re getting the illegal memory address error? I don’t know. Can you provide a small reproducer?

I would expect the compiler to add an implicit data copy of the arrays with the size matching the loop trip count. What does the compiler feedback messages say?

Here’s what I would expect:

% nvc test.c -acc -Minfo=accel -V23.3
test1:
     12, Generating implicit firstprivate(n,i)
         Generating NVIDIA GPU code
         14, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     12, Generating implicit copy(y1_acc[:n]) [if not already present]
         Generating implicit copyin(x1_acc[:n]) [if not already present]

For more complex data structures, like multi-dimensional arrays, structs with dynamically data members, etc., the compiler wont be able to add implicitly the data movement. Granted even with single dimensional arrays, C uses unbounded pointer rather than real arrays like Fortran, so the compiler is just guessing that the size of the array matches the loop trip count. Hence it’s typically better to include an explicit copy clause.

#pragma acc parallel loop copyin(x1_acc[:n]) copy(y1_acc[:n])

The problem with this is that the data is going to copied each time the kernel is entered. If you enter it more than once, or if the data is used in multiple kernels, then you’ll have a lot of extra unnecessary data movement.

My recommended strategy is to take a top-down approach to data management. Basically hoist the data directives to the same spot where the data is allocated, and then use update directives to synchronize the data. This way the lifetime and scope of the device data matches the host, and you have direct control over when the data movement occurs.

For example, consider the following:

#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>

void test1(int n, int *y1_acc, int *x1_acc) {
#pragma acc parallel loop copyin(x1_acc[:n]) copy(y1_acc[:n])
   for(int i=0; i<n; ++i) {
      y1_acc[i] = x1_acc[i] + y1_acc[i];
   }

   return;
}

int main() {
  int n=512;
  int *y1_acc, *x1_acc;
  y1_acc = (int*) malloc(n*sizeof(int));
  x1_acc = (int*) malloc(n*sizeof(int));
#pragma acc enter data create(y1_acc[:n], x1_acc[:n])

   for(int i=0; i<n; ++i) {
      y1_acc[i] = i;
      x1_acc[i] = i;
   }
#pragma acc update device(y1_acc[:n], x1_acc[:n])
  for (int ntimes=0; ntimes < 100; ++ntimes) {
      test1(n,y1_acc,x1_acc);
  }
#pragma acc update self(y1_acc[:n])
  for(int i=5; i<10; ++i) {
     printf("%d:%d\n",i,y1_acc[i]);
  }

#pragma acc exit data delete(y1_acc,x1_acc)
  free(x1_acc);
  free(y1_acc);
  exit(0);
}

Here, I’m managing the data at the same points where it’s allocated and free’d, and only update when needed. Setting “NV_ACC_TIME=1” in my environment to get a basic profile, we see the data is only copied 3 times, x and y to the GPU, and then y back to the host.

% nvc test.c -acc -Minfo=accel -V23.3; a.out
test1:
      5, Generating copy(y1_acc[:n]) [if not already present]
         Generating copyin(x1_acc[:n]) [if not already present]
         Generating implicit firstprivate(n,i)
         Generating NVIDIA GPU code
          7, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
main:
     18, Generating enter data create(y1_acc[:n],x1_acc[:n])
     24, Generating update device(x1_acc[:n],y1_acc[:n])
     28, Generating update self(y1_acc[:n])
     35, Generating exit data delete(y1_acc[:1],x1_acc[:1])
5:505
6:606
7:707
8:808
9:909

Accelerator Kernel Timing data
/home/mcolgrove/tmp/test.c
  test1  NVIDIA  devicenum=0
    time(us): 205
    5: compute region reached 100 times
        5: kernel launched 100 times
            grid: [4]  block: [128]
             device time(us): total=205 max=4 min=2 avg=2
            elapsed time(us): total=2,086 max=556 min=14 avg=20
    5: data region reached 200 times
/home/mcolgrove/tmp/test.c
  main  NVIDIA  devicenum=0
    time(us): 27
    18: data region reached 1 time
    24: update directive reached 1 time
        24: data copyin transfers: 2
             device time(us): total=11 max=7 min=4 avg=5
    28: update directive reached 1 time
        28: data copyout transfers: 1
             device time(us): total=16 max=16 min=16 avg=16
    35: data region reached 1 time

Now if I comment out the acc pragmas in main so the data movement occurs at the point of the kernel, then it’s doing 300 copies, 3 for each kernel call:

% nvc test.c -acc -Minfo=accel -V23.3 ; a.out
test1:
      5, Generating copy(y1_acc[:n]) [if not already present]
         Generating copyin(x1_acc[:n]) [if not already present]
         Generating implicit firstprivate(n,i)
         Generating NVIDIA GPU code
          7, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
5:505
6:606
7:707
8:808
9:909

Accelerator Kernel Timing data
/home/mcolgrove/tmp/test.c
  test1  NVIDIA  devicenum=0
    time(us): 1,952
    5: compute region reached 100 times
        5: kernel launched 100 times
            grid: [4]  block: [128]
             device time(us): total=208 max=3 min=2 avg=2
            elapsed time(us): total=2,165 max=531 min=16 avg=21
    5: data region reached 200 times
        5: data copyin transfers: 200
             device time(us): total=838 max=8 min=3 avg=4
        9: data copyout transfers: 100
             device time(us): total=906 max=16 min=8 avg=9

rene.puttin · May 30, 2023, 5:21am

Hi Matt,

once again, a very big thank your from my side for your detailed answers. The hint that in C code in contrast to Fortran code specifying the dimensions to copy was very helpful. Of course this makes sense!

I have got one addition question in this regard. Here is the corresponding test code:

#include <stdio.h>

int main(void)
{

    const int n = 10000000;
    float *x, *y;
    float a = 2.0;
    float avg;
    int i;

    x = (float*) malloc(n * sizeof(float));
    y = (float*) malloc(n * sizeof(float));

    for (i = 0; i < n; i++)
    {
       x[i]  = 2.0;
       y[i]  = 1.0;
    }

#pragma acc parallel loop
//#pragma acc parallel loop copy(y[0:n]) copyin(x[0:n])
#pragma omp target teams distribute parallel for
//#pragma omp target teams distribute parallel for map(tofrom: y[0:n]) map(to: x[0:n])
    for (i = 0; i < n; i++)
       y[i] = a * x[i] + y[i];

    avg = 0.0;
    for (i = 0; i < n; i++)
       avg += y[i];
    avg = avg/n;
    fprintf(stderr, "CHECKSUM: %f\n", avg);

    free(x);
    free(y);
}

I compile this code with the following commands:

nvc++ -acc=gpu -Minfo -fast test.cu -gpu=nomanaged -o saxpy_noum_acc
nvc++ -mp=gpu  -Minfo -fast test.cu -gpu=nomanaged -o saxpy_noum_mp

The OpenACC code runs properly, but the OpenMP code aborts with the message

Fatal error: expression 'HX_CU_CALL_CHECK(p_cuStreamSynchronize(stream[dev]))' (value 1) is not equal to expression 'HX_SUCCESS' (value 0)

While the compiler diagnostics messages report an implicit data copy in ACC case compilation

     25, Generating implicit copy(y[:10000000]) [if not already present]
         Generating implicit copyin(x[:10000000]) [if not already present]

the is no corresponding message in the OpenMP compilation.

When replacing the #pragma lines by commented the lines with explicit data copy statements, both ACC and OpenMP code work properly.

Is this expected behaviour, that OpenMP does not generate implicit copying statements here? (When compiling with UM setup the code works properly.)

Thanks a lot and regards,

Rene’

MatColgrove · May 30, 2023, 5:30pm

Correct. OpenACC will do implicit data movement when the size of the array is known. With OpenMP this needs to be done explicitly.

rene.puttin · May 31, 2023, 3:56am

Hi Matt, once again thanks for the quick answer. I think that at least for the moment I have no more questions. My test codes are running properly now with the different Offload models and different kinds of memory usage (stack, heap, UM).
Thanks for all your advice, I have learned a lot. If you like you can close this ticket.

system · June 14, 2023, 3:56am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.