Dear NVIDIA-HPC team,
the following test code (provided in C and Fortran) shows an unexpected behaviour when compiled with option -stdpar=gpu
.
The OpenACC directive specifies array x
as present
, but the array is not copied to the device beforehand. Therefore the code should abort.
#include <stdio.h>
int main(void)
{
const int n = 1000000;
#ifndef ALLOC
float x[n];
#else
float* x;
#endif
float avg_i;
int i;
#ifdef ALLOC
x = (float*) malloc(n * sizeof(float));
#endif
for (i = 0; i < n; i++)
x[i] = 1.0;
avg_i = 0.0;
for (i = 0; i < n; i++)
avg_i += x[i];
avg_i /= n;
printf("avg(x): %f\n", avg_i);
#pragma acc parallel loop present(x)
for (i = 0; i < n; i++)
x[i] = x[i] + 1.0;
printf("cudaError: %s\n", cudaGetErrorString(cudaGetLastError()));
avg_i = 0.0;
for (i = 0; i < n; i++)
avg_i += x[i];
avg_i /= n;
printf("avg(x): %f\n", avg_i);
#ifdef ALLOC
free(x);
#endif
}
The code behaves correctly when compiled with the following options:
nvc++ -cpp -acc=gpu -Minfo=all -fast test_simple.cu -o test_c_simple
It aborts (as expected since a present
clause without previous data copy is used) with the array message:
FATAL ERROR: data in PRESENT clause was not found on device 1: name=x[:] host:0x2afdf9389010 file:/home/rputtin/test/OpenACC/saxpy/test_simple.cu main line:29
In order to compile the code with manually allocated arrays the preprocessor macro ALLOC
can be used:
nvc++ -DALLOC -cpp -acc=gpu -Minfo=all -fast test_simple.cu -o test_c_simple
With these options the code still behaves correctly (and aborts).
But when compiler option -stdpar=gpu
is added the code finishes without abort:
nvc++ -DALLOC -cpp -stdpar=gpu -acc=gpu -Minfo=all -fast test_simple.cu -o test_c_simple
The code finishes properly and produces correct results. This seems to imply that the array x
is copied to and from the device although the present
clause is set.
The same behaviour can also be overserved with the corresponding Fortran code:
program main
use cudafor
implicit none
integer, parameter :: n = 1000000
#ifndef ALLOC
real :: x(n)
#else
real,allocatable :: x(:)
#endif
integer :: i
#ifdef ALLOC
allocate(x(n))
#endif
do i = 1, n
x(i) = 1.0
enddo
print *, "avg(x):", sum(x)/n
!$acc parallel loop present(x)
do i = 1, n
x(i) = x(i) + 1.0
end do
!$acc end parallel loop
print *, "cudaError:", cudaGetErrorString(cudaGetLastError());
print *, "avg(x):", sum(x)/n
#ifdef ALLOC
deallocate(x)
#endif
end program
This simplified test code is an excerpt from a more complex test code, which measures times for computation and data copy to/from device separately.
In the more complex test code one can explictly observe timings for copying to / from device (triggered by OpenACC / OpenMP directives). But when using manually allocated arrays and option -stdpar=gpu
the explicit copying times disappear, but the timings in the parallel loops increase by an equivalent amount of time.
So, in summary it seems that compiler option -stdpar=gpu
disables manual data handling for manually allocated arrays but instead generates implicit data copying operation, even when not necessary.
Is there a reason for this behaviour? Is this intended or is a compiler problem?
Thanks a lot your support and regards,
Rene’