OpenACC create and pcreate clauses need host allocation?

Regarding the create clause in the data construct the OpenACC v2.0 specification says:

…On a data construct or compute construct, the data is allocated in device memory upon entry to the region, and deallocated upon exit from the region. …

This suggests that it should be enough to declare a pointer on the host and call the create clause on it to allocate memory on the accelerator. But, this clause only seem to work if the data on the host side is also allocated, regardless of the fact that the size of allocation is specified in the squared brackets. The code looks the following:

inline void adi_acc(float lambda, float *u, float *du, float *ax, float *bx, float *cx, int nx, int ny, int nz ...) {
  int    i, j, k;
  int n = nx*ny*nz;

  #pragma acc data present(u[n], du[n], ax[n], bx[n], cx[n])
  {
    #pragma acc kernels loop collapse(3) independent
    for(k=0; k<NZ; k++) {
      for(j=0; j<NY; j++) {
        for(i=0; i<NX; i++) {
           ...     
        }
      }
    }
    cuda_library_wrapper(u,du,ax,bx,cx);
  }
}

int main(int argc, char* argv[]) { 
  ...
  float *h_u, *h_du, *h_ax, *h_bx, *h_cx;
  int n = NX*NY*NZ;

  h_u  = (float *)malloc(sizeof(float)*n);
  //h_du = (float *)malloc(sizeof(float)*n);
  //h_ax = (float *)malloc(sizeof(float)*n);
  //h_bx = (float *)malloc(sizeof(float)*n);
  //h_cx = (float *)malloc(sizeof(float)*n);
  ...
  acc_init(acc_device_nvidia);
  acc_set_device_num(0,acc_device_nvidia);
  ...
  #pragma acc data pcopy(h_u[n]) create(h_du[n], h_ax[n], h_bx[n], h_cx[n])
  {
    elapsed_time(&timer2);
    for(it=0; it<iter; it++) {
      adi_acc(lambda, h_u, h_du, h_ax, h_bx, h_cx, nx, ny, nz, ...);
    elapsed_total = elapsed_time(&timer2);
  }  
  free(h_u);
  free(h_du);
  free(h_ax);
  free(h_bx);
  free(h_cx);
  acc_shutdown(acc_device_nvidia);
  exit(0);
}

So, when the host memory allocation lines are put into comments, I get the following error message:


FATAL ERROR: variable data clause is partially present on the GPU: name=h_ax
file:/home/endre/workspace/adi/src/adi_acc_libtrid.c main line:285


where the line 285 is the one with the create() clause:

#pragma acc data pcopy(h_u[n]) create(h_du[n], h_ax[n], h_bx[n], h_cx[n])

When I run the code with cuda-memcheck, It also reports error concerning CUDA Context. Although this error seems to be regardless of the create() clause:

========= Program hit error 201 on CUDA API call to cuCtxAttach
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib64/libcuda.so (cuCtxAttach + 0x182) [0xd9af2]
========= Host Frame:./adi_acc_libtrid [0x1a755]

FATAL ERROR: variable data clause is partially present on the GPU: name=h_ax
file:/home/endre/workspace/adi/src/adi_acc_libtrid.c main line:285
========= ERROR SUMMARY: 1 error

Note: I link against an nvcc-compiled wrapper function.

Used config: Red Hat v6, PGI 14.1, NVIDIA K20c

The question is:

Is it really needed to allocate memory on the host as well?

Any help is much appreciated!

Hi endre,

Is it really needed to allocate memory on the host as well?

With a “create” clause, you do still need an allocated host copy. The reason being that you can use the “update” directive on the device array and thus need an associated host array.

What you want instead is to use “acc_malloc” to allocate the arrays only on the device and then use a “deviceptr” clause instead of “create” and "present to indicate that these are device pointers.

Also, since you’re passing “h_u” to a CUDA wrapper, you might want to use the “host_data” directive so that h_u’s device pointer is passed into “adi_acc”. Something like the following:

inline void adi_acc(float lambda, float *u, float *du, float *ax, float *bx, float *cx, int nx, int ny, int nz ...) { 
  int    i, j, k; 
  int n = nx*ny*nz; 

  #pragma acc data deviceptr(u[n], du[n], ax[n], bx[n], cx[n]) 
  { 
    #pragma acc kernels loop collapse(3) independent 
    for(k=0; k<NZ; k++) { 
      for(j=0; j<NY; j++) { 
        for(i=0; i<NX; i++) { 
           ...      
        } 
      } 
    } 
    cuda_library_wrapper(u,du,ax,bx,cx); 
  } 
} 

int main(int argc, char* argv[]) { 
  ... 
  float *h_u, *h_du, *h_ax, *h_bx, *h_cx; 
  int n = NX*NY*NZ; 

  h_u  = (float *)malloc(sizeof(float)*n); 
  h_du = (float *)acc_malloc(sizeof(float)*n); 
  h_ax = (float *)acc_malloc(sizeof(float)*n); 
  h_bx = (float *)acc_malloc(sizeof(float)*n); 
  h_cx = (float *)acc_malloc(sizeof(float)*n); 
  ... 
  acc_init(acc_device_nvidia); 
  acc_set_device_num(0,acc_device_nvidia); 
  ... 
  #pragma acc data pcopy(h_u[n]) deviceptr(h_du[n], h_ax[n], h_bx[n], h_cx[n]) 
  { 
   #pragma acc host_data use_device(h_u) 
   {
    elapsed_time(&timer2); 
    for(it=0; it<iter; it++) { 
      adi_acc(lambda, h_u, h_du, h_ax, h_bx, h_cx, nx, ny, nz, ...); 
    elapsed_total = elapsed_time(&timer2); 
  } 
  } 
  free(h_u); 
  acc_free(h_du); 
  acc_free(h_ax); 
  acc_free(h_bx); 
  acc_free(h_cx);

Hope this helps,
Mat

Hi Mat,

Thanks for the help. It works now. Although I would make a correction: deviceptr() only needs/accepts pointers without the size specifcation with squared brackets.

I’m still facing the issue of “Program hit error 201 on CUDA API call to cuCtxAttach”. It might only be an issue with cuda-memcheck and this error might not even occur in the background when I’m running my code without memcheck. Is it possible that this issue has something to do with PGI and cuda-memcheck incompatibility? I tried to compile my code by commenting every accelerator execution parts (to avoid any memory conflict) and also tried to run a very simple OpenACC code with cuda-memcheck, and both gave the same error.

Regards,
Endre

Hi endre,

The Error “201” is expected. The OpenACC runtime is testing if a context has already been created by calling cuCtxAttach. If the call fails, a new context will be created.

  • Mat