OpenACC async problem when using PGI compiler v13.9 or v14.1

Endre · February 2, 2014, 5:39pm

The OpenACC quick reference guide says about async in the kernels contruct:

The kernels region executes asynchronously with the host.

This seems to work with PGI 13.6 (–ta=nvidia,cuda5.0,sm35) and I can measure the execution time of the kernels region. But it doesn’t work with 13.9 and 14.1 with the –ta=nvidia,cuda5.5,sm35 options:

inline void adi_acc(float lambda, float *u, float *du, float *ax, float *bx, float *cx, int nx, int ny, int nz ...) {
  int    i, j, k;
  int n = nx*ny*nz;

  #pragma acc data present(u[n], du[n], ax[n], bx[n], cx[n])
  {
    #pragma acc kernels loop collapse(3) independent async
    for(k=0; k<NZ; k++) {
      for(j=0; j<NY; j++) {
        for(i=0; i<NX; i++) {
           ...     
        }
      }
    }
    cuda_library_wrapper(u,du,ax,bx,cx);
    #pragma acc kernels loop collapse(2) independent private(aa, bb, cc, dd, base, ind, c2, d2) async
    for(k=0; k<nz; k++) {
      for(i=0; i<NX; i++) {
        ...
      }
    }
  }
}

int main(int argc, char* argv[]) { 
  ...
  float *h_u, *h_du, *h_ax, *h_bx, *h_cx;
  int n = NX*NY*NZ;

  h_u  = (float *)malloc(sizeof(float)*n);
  h_du = (float *)malloc(sizeof(float)*n);
  h_ax = (float *)malloc(sizeof(float)*n);
  h_bx = (float *)malloc(sizeof(float)*n);
  h_cx = (float *)malloc(sizeof(float)*n);
  ...
  acc_init(acc_device_nvidia);
  acc_set_device_num(0,acc_device_nvidia);
  ...
  #pragma acc data pcopy(h_u[n]) create(h_du[n], h_ax[n], h_bx[n], h_cx[n])
  {
    elapsed_time(&timer2);
    for(it=0; it<iter; it++) {
      adi_acc(lambda, h_u, h_du, h_ax, h_bx, h_cx, nx, ny, nz, ...);
    elapsed_total = elapsed_time(&timer2);
  }  
  free(h_u);
  free(h_du);
  free(h_ax);
  free(h_bx);
  free(h_cx);
  acc_shutdown(acc_device_nvidia);
  exit(0);
}

The error I get is:

Segmentation fault (core dumped)

The error I get with cuda-memcheck is:

========= CUDA-MEMCHECK

Grid dimensions: 256 x 256 x 256
========= Program hit error 201 on CUDA API call to cuCtxAttach
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib64/libcuda.so (cuCtxAttach + 0x182) [0xd9af2]
========= Host Frame:./adi_acc_libtrid [0x1a615]

========= Error: process didn’t terminate successfully
========= Internal error (20)
========= No CUDA-MEMCHECK results found

Used config: Red Hat v6, NVIDIA K20c

The question is:

Why might an async “kernel call” result in segmentation fault when the consequent kernels actually run in-order in the same CUDA stream?

Any help is much appreciated!

MatColgrove · February 3, 2014, 7:27pm

Why might an async “kernel call” result in segmentation fault when the consequent kernels actually run in-order in the same CUDA stream?

You’re assuming that they are using the same stream. The async identifier id is not the same as the CUDA stream id. However, you can call the OpenACC runtime routine “void * acc_get_cuda_stream( int async )” to retrieve a handle to the CUDA stream which in turn can be added to the launch parameters of your CUDA kernel.

As to why your code is seg faulting, I’m not sure. It’s possible that it’s the stream issue, but it could be for other reasons as well. Without a reproducing example, I can’t tell.

Mat

Endre · February 4, 2014, 3:48pm

Hi Mat,

Thanks for the comment on the streams, but I don’t really understand what is going on with the async. My code runs fine with synchronous execution, but if I mark the kernels construct to be async and I also use #pragma acc wait or acc_async_wait_all() after the kernel, I get segfault.

Ie. running the kernel synchronously OR running it asynchronously with synchronization right immediately after the #pragma acc kernels loop async{} block result in different behaviour.

#pragma acc kernels loop
for(...) {
   ...
}

NOT EQUAL TO

#pragma acc kernels loop async
for(...) {
   ...
} 
#pragma acc wait

OR

#pragma acc kernels loop async
for(...) {
   ...
} 
acc_async_wait_all();

Endre

MatColgrove · February 4, 2014, 4:55pm

Hi Endre,

I’m not sure what’s wrong since there’s nothing inherently different about the code generation with or without async. Can you post or sent to PGI Customer Service (trs@pgroup.com) a reproducing example?

Thanks,
Mat

Topic		Replies	Views
Oddity in OpenACC Legacy PGI Compilers	15	13105	November 23, 2015
Overlapping kernel and data execution Legacy PGI Compilers	2	3528	March 8, 2013
PGI Accelerator async clause results in error Legacy PGI Compilers	6	4931	May 31, 2012
CPU parallel and accelerator regions in the same program Legacy PGI Compilers	13	9014	July 18, 2012
Segmentation fault Legacy PGI Compilers	1	3050	April 4, 2011
accelerator strange beahviour Legacy PGI Compilers	6	9398	November 1, 2013
pgfortran segfaults when compiling acc (14.1 only) Legacy PGI Compilers	3	3451	April 7, 2014
APM PGI 10.5 - !$acc do kernel Legacy PGI Compilers	1	4530	May 17, 2010
Launch failed error Legacy PGI Compilers	8	11262	September 13, 2013
Suggestions needed to debug a weird segfault Legacy PGI Compilers	8	1109	July 12, 2021

OpenACC async problem when using PGI compiler v13.9 or v14.1

Grid dimensions: 256 x 256 x 256 ========= Program hit error 201 on CUDA API call to cuCtxAttach ========= Saved host backtrace up to driver entry point at error ========= Host Frame:/usr/lib64/libcuda.so (cuCtxAttach + 0x182) [0xd9af2] ========= Host Frame:./adi_acc_libtrid [0x1a615]

Related topics

Grid dimensions: 256 x 256 x 256
========= Program hit error 201 on CUDA API call to cuCtxAttach
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib64/libcuda.so (cuCtxAttach + 0x182) [0xd9af2]
========= Host Frame:./adi_acc_libtrid [0x1a615]