OpenACC create and pcreate clauses need host allocation?

Endre · February 2, 2014, 5:17pm

Regarding the create clause in the data construct the OpenACC v2.0 specification says:

…On a data construct or compute construct, the data is allocated in device memory upon entry to the region, and deallocated upon exit from the region. …

This suggests that it should be enough to declare a pointer on the host and call the create clause on it to allocate memory on the accelerator. But, this clause only seem to work if the data on the host side is also allocated, regardless of the fact that the size of allocation is specified in the squared brackets. The code looks the following:

inline void adi_acc(float lambda, float *u, float *du, float *ax, float *bx, float *cx, int nx, int ny, int nz ...) {
  int    i, j, k;
  int n = nx*ny*nz;

  #pragma acc data present(u[n], du[n], ax[n], bx[n], cx[n])
  {
    #pragma acc kernels loop collapse(3) independent
    for(k=0; k<NZ; k++) {
      for(j=0; j<NY; j++) {
        for(i=0; i<NX; i++) {
           ...     
        }
      }
    }
    cuda_library_wrapper(u,du,ax,bx,cx);
  }
}

int main(int argc, char* argv[]) { 
  ...
  float *h_u, *h_du, *h_ax, *h_bx, *h_cx;
  int n = NX*NY*NZ;

  h_u  = (float *)malloc(sizeof(float)*n);
  //h_du = (float *)malloc(sizeof(float)*n);
  //h_ax = (float *)malloc(sizeof(float)*n);
  //h_bx = (float *)malloc(sizeof(float)*n);
  //h_cx = (float *)malloc(sizeof(float)*n);
  ...
  acc_init(acc_device_nvidia);
  acc_set_device_num(0,acc_device_nvidia);
  ...
  #pragma acc data pcopy(h_u[n]) create(h_du[n], h_ax[n], h_bx[n], h_cx[n])
  {
    elapsed_time(&timer2);
    for(it=0; it<iter; it++) {
      adi_acc(lambda, h_u, h_du, h_ax, h_bx, h_cx, nx, ny, nz, ...);
    elapsed_total = elapsed_time(&timer2);
  }  
  free(h_u);
  free(h_du);
  free(h_ax);
  free(h_bx);
  free(h_cx);
  acc_shutdown(acc_device_nvidia);
  exit(0);
}

So, when the host memory allocation lines are put into comments, I get the following error message:

FATAL ERROR: variable data clause is partially present on the GPU: name=h_ax
file:/home/endre/workspace/adi/src/adi_acc_libtrid.c main line:285

where the line 285 is the one with the create() clause:

#pragma acc data pcopy(h_u[n]) create(h_du[n], h_ax[n], h_bx[n], h_cx[n])

When I run the code with cuda-memcheck, It also reports error concerning CUDA Context. Although this error seems to be regardless of the create() clause:

========= Program hit error 201 on CUDA API call to cuCtxAttach
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib64/libcuda.so (cuCtxAttach + 0x182) [0xd9af2]
========= Host Frame:./adi_acc_libtrid [0x1a755]

FATAL ERROR: variable data clause is partially present on the GPU: name=h_ax
file:/home/endre/workspace/adi/src/adi_acc_libtrid.c main line:285
========= ERROR SUMMARY: 1 error

Note: I link against an nvcc-compiled wrapper function.

Used config: Red Hat v6, PGI 14.1, NVIDIA K20c

The question is:

Is it really needed to allocate memory on the host as well?

Any help is much appreciated!

MatColgrove · February 3, 2014, 7:06pm

Hi endre,

Is it really needed to allocate memory on the host as well?

With a “create” clause, you do still need an allocated host copy. The reason being that you can use the “update” directive on the device array and thus need an associated host array.

What you want instead is to use “acc_malloc” to allocate the arrays only on the device and then use a “deviceptr” clause instead of “create” and "present to indicate that these are device pointers.

Also, since you’re passing “h_u” to a CUDA wrapper, you might want to use the “host_data” directive so that h_u’s device pointer is passed into “adi_acc”. Something like the following:

inline void adi_acc(float lambda, float *u, float *du, float *ax, float *bx, float *cx, int nx, int ny, int nz ...) { 
  int    i, j, k; 
  int n = nx*ny*nz; 

  #pragma acc data deviceptr(u[n], du[n], ax[n], bx[n], cx[n]) 
  { 
    #pragma acc kernels loop collapse(3) independent 
    for(k=0; k<NZ; k++) { 
      for(j=0; j<NY; j++) { 
        for(i=0; i<NX; i++) { 
           ...      
        } 
      } 
    } 
    cuda_library_wrapper(u,du,ax,bx,cx); 
  } 
} 

int main(int argc, char* argv[]) { 
  ... 
  float *h_u, *h_du, *h_ax, *h_bx, *h_cx; 
  int n = NX*NY*NZ; 

  h_u  = (float *)malloc(sizeof(float)*n); 
  h_du = (float *)acc_malloc(sizeof(float)*n); 
  h_ax = (float *)acc_malloc(sizeof(float)*n); 
  h_bx = (float *)acc_malloc(sizeof(float)*n); 
  h_cx = (float *)acc_malloc(sizeof(float)*n); 
  ... 
  acc_init(acc_device_nvidia); 
  acc_set_device_num(0,acc_device_nvidia); 
  ... 
  #pragma acc data pcopy(h_u[n]) deviceptr(h_du[n], h_ax[n], h_bx[n], h_cx[n]) 
  { 
   #pragma acc host_data use_device(h_u) 
   {
    elapsed_time(&timer2); 
    for(it=0; it<iter; it++) { 
      adi_acc(lambda, h_u, h_du, h_ax, h_bx, h_cx, nx, ny, nz, ...); 
    elapsed_total = elapsed_time(&timer2); 
  } 
  } 
  free(h_u); 
  acc_free(h_du); 
  acc_free(h_ax); 
  acc_free(h_bx); 
  acc_free(h_cx);

Hope this helps,
Mat

Endre · February 4, 2014, 3:21pm

Hi Mat,

Thanks for the help. It works now. Although I would make a correction: deviceptr() only needs/accepts pointers without the size specifcation with squared brackets.

I’m still facing the issue of “Program hit error 201 on CUDA API call to cuCtxAttach”. It might only be an issue with cuda-memcheck and this error might not even occur in the background when I’m running my code without memcheck. Is it possible that this issue has something to do with PGI and cuda-memcheck incompatibility? I tried to compile my code by commenting every accelerator execution parts (to avoid any memory conflict) and also tried to run a very simple OpenACC code with cuda-memcheck, and both gave the same error.

Regards,
Endre

MatColgrove · February 4, 2014, 4:50pm

Hi endre,

The Error “201” is expected. The OpenACC runtime is testing if a context has already been created by calling cuCtxAttach. If the call fails, a new context will be created.

Mat

Topic		Replies	Views
How to create data on device Legacy PGI Compilers	2	4013	December 25, 2014
Performance of OpenACC data create Legacy PGI Compilers	8	6452	June 24, 2013
Avoid reallocating memory on the GPU. Legacy PGI Compilers	11	5745	January 28, 2013
Acc_malloc() in Fortran to avoid host allocation nvc, nvc++ and nvfortran	4	1044	August 8, 2024
OpenAcc not allocating memory on GPU Legacy PGI Compilers	3	2803	August 1, 2012
OpenACC C/CUDA error Legacy PGI Compilers	2	5007	October 30, 2013
Strange run-time error: variable data clause is partially... Legacy PGI Compilers	2	6899	August 7, 2012
Where to use this OpenACC pragma? Legacy PGI Compilers	1	960	March 29, 2022
Simple (?) question on OpenACC host_data usage Legacy PGI Compilers	1	2914	January 7, 2014
Openacc fortran pointer multi-dimension array Legacy PGI Compilers	2	791	June 9, 2023

OpenACC create and pcreate clauses need host allocation?

========= Program hit error 201 on CUDA API call to cuCtxAttach ========= Saved host backtrace up to driver entry point at error ========= Host Frame:/usr/lib64/libcuda.so (cuCtxAttach + 0x182) [0xd9af2] ========= Host Frame:./adi_acc_libtrid [0x1a755]

Related topics

========= Program hit error 201 on CUDA API call to cuCtxAttach
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib64/libcuda.so (cuCtxAttach + 0x182) [0xd9af2]
========= Host Frame:./adi_acc_libtrid [0x1a755]