Oddity in OpenACC

I would be grateful if anyone has any clues to this, particularly whether it is an install problem or my poor OpenACC code.

Here’s the OpenACC bit:

#pragma acc data copy(arrC)
#pragma acc kernels
  for(j=0;j<sz;j++){
    for (i=0;i<sz;i++){
      arrC[j][i] = arrA[j][i]*alpha + arrB[j][i];
    }
  }

arrA,B,C are all sz * sz arrays, where sz=10, so nothing huge.

The compiler generates what I’d expect

pgcc  -o basic basic.c -Minfo=accel,time  -acc -ta=nvidia
main:
     35, Generating copy(arrC[:][:])
     36, Generating copyin(arrA[:10][:10])
         Generating copyin(arrB[:10][:10])
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     37, Loop is parallelizable
     38, Loop is parallelizable
         Accelerator kernel generated
         37, #pragma acc loop gang, vector(3) /* blockIdx.y threadIdx.y */
         38, #pragma acc loop gang, vector(10) /* blockIdx.x threadIdx.x */
             CC 1.3 : 13 registers; 116 shared, 12 constant, 0 local memory bytes; 25% occupancy
             CC 2.0 : 15 registers; 8 shared, 124 constant, 0 local memory bytes; 16% occupancy
  Timing stats:
    init                    50 millisecs    74%
    expand                  17 millisecs    25%
    Total time              67 millisecs

The vector sizes are quite short but this is a toy example so no problems there.
However, when I run the code I get:

./basic 
call to EventSynchronize returned error 700: Launch failed
CUDA driver version: 4010

After a number of iterations, I do sometimes get it to run but the output array (arrC) has not changed (I set it to 0 before the accelerator region).

Interestingly, unsetting PGI_ACC_TIME changes this error to:

call to cuMemFree returned error 700: Launch failed
CUDA driver version: 4010

Some more iterations does eventually get it to run but still with a bad output.

-Nick.

I should have added to the previous I’m compiling with the 12.3 compiler.

Hi Nickaj,

In OpenACC arrays are expected to be contiguous. So if arrC is a pointer to a pointer, this would cause your program to abnormally abort.

Can you post a full reproducing example which also includes how your arrays are declared?

Thanks,
Mat

Hi Nickaj,

FYI, we got another report of a similar issue and this one does look like a compiler error with the beta OpenACC. Though, do you mind still posting a bit more of your code so I can confirm that it’s the same issue?

Thanks,
Mat

Here’s my complete code:

#include<stdio>
#include<stdlib>
#include <openacc>


int main(int argc, char *argv[])
{

  int i = 0;
  int j = 0;
  int sz = 10;

  double arrA[sz][sz];
  double arrB[sz][sz];
  double arrC[sz][sz];
  for(j=0;j<sz;j++){
    for (i=0;i<sz;i++){
      arrA[j][i] = 1;
      arrB[j][i] = 2;
      arrC[j][i] = 0;
    }
  }

  double alpha = 0.5;

#pragma acc data copy(arrC[:10][:10])
#pragma acc kernels
  for(j=0;j<sz;j++){
    for (i=0;i<sz;i++){
      arrC[j][i] = arrA[j][i]*alpha + arrB[j][i];
    }
  }

  for(j=0;j<2;j++){
    for(i=0;i<10;i++){
      printf("arrC[%d][%d] = %lf\n", j, i, arrC[j][i]);
    }
  }

  return 0;
}

Yep, same problem. The compiler is tripping over the use of VLAs. Changing them to fixed size or malloc’d arrays will work around the problem.

Since this just came in last Friday, I’m not sure we’ll get it fixed by the 12.4 release, but we’ll try.

  • Mat
% cat test.c 

#include<stdio>
#include<stdlib>
#include<openacc>

#define N 10

int main(int argc, char *argv[])
{

  int i = 0;
  int j = 0;
  int sz = N;

  double arrA[N][N];
  double arrB[N][N];
  double arrC[N][N];
  for(j=0;j<sz;j++){
    for (i=0;i<sz;i++){
      arrA[j][i] = 1;
      arrB[j][i] = 2;
      arrC[j][i] = 0;
    }
  }

  double alpha = 0.5;

#pragma acc data copy(arrC[:10][:10])
#pragma acc kernels
  for(j=0;j<sz;j++){
    for (i=0;i<sz;i++){
      arrC[j][i] = arrA[j][i]*alpha + arrB[j][i];
    }
  }

  for(j=0;j<2;j++){
    for(i=0;i<10;i++){
      printf("arrC[%d][%d] = %lf\n", j, i, arrC[j][i]);
    }
  }

  return 0;
} 
% pgcc -acc test.c -V12.3
% a.out
arrC[0][0] = 2.500000
arrC[0][1] = 2.500000
arrC[0][2] = 2.500000
arrC[0][3] = 2.500000
arrC[0][4] = 2.500000
arrC[0][5] = 2.500000
arrC[0][6] = 2.500000
arrC[0][7] = 2.500000
arrC[0][8] = 2.500000
arrC[0][9] = 2.500000
arrC[1][0] = 2.500000
arrC[1][1] = 2.500000
arrC[1][2] = 2.500000
arrC[1][3] = 2.500000
arrC[1][4] = 2.500000
arrC[1][5] = 2.500000
arrC[1][6] = 2.500000
arrC[1][7] = 2.500000
arrC[1][8] = 2.500000

It seems I encountered the same problem: I got the following error:

call to cuMemFree returned error 700: Launch failed

It occurs when I use “async(1)” clause (even if I put “!$acc wait(1)” directive just after every parallel clause). And yes, I use arrays with ALLOCATABLE attribute.

Hi Maxim,

nickaj’s problem has to do with C99 VLAs so your issue is most likely unrelated. A error 700 “Launched failed” is a generic error meaning that your kernel failed for some reason. Can you please give more detail about the error as well as a reproducing example?

Thanks,
Mat

Hi Mat,

I am still trying to nail down the issue to be able to make a small reproduction code.

I managed to catch the error at some earlier stage:

Consider the pseudo code:

!$acc data ...clauses go here...
DO i = 1, n
!$acc parallel
...several parallelizable cycles go here...
!$acc end parallel
DO j = 1, n
!$acc parallel
...other parallelizable cycles go here, each iteration j+1 depends on previous one j...
!$acc end parallel
END DO
END DO
!$acc end data

This code works. The problem is that it executes a lot of paralel regions synchronously, waiting at the HOST for each region (kernels) to complete and only then scheduling another one. It is slow. It would be highly desirable to push all the kernels to the GPU and wait for the final one to complete.

My very first step was to mark the first parallel region (in the outer loop) with “async” clause and then, just after the region, I put “!$acc wait” directive. I perfectly understand that we cannot expect such a code to run any faster than the original one, but I did it just to check whether async is supported. Here is the new code:

!$acc data ...clauses go here...
DO i = 1, n
!$acc parallel async
...several parallelizable cycles go here...
!$acc end parallel
!$acc wait
DO j = 1, n
!$acc parallel
...other parallelizable cycles go here, each iteration j+1 depends on previous one j...
!$acc end parallel
END DO
END DO
!$acc end data

It always fails with error “call to cuMemcpyHtoD returned error 1: Invalid value”. I traced the code: The error ooccurs at random iterations of outer and inner cycles. And thus I wonder whether async/wait is working or not…

P.S. There are no data memory transfers which would correspond to error encountered. I attribute this error to the transfer of some data required to launch the kernel (parameters e t.c.)

I traced the code: The error ooccurs at random iterations of outer and inner cycles. And thus I wonder whether async/wait is working or not…

Async support is very new so it’s possible that there are problem. Though, without an example it’s very difficult to tell what’s wrong. Note, if the code is too long or you don’t want it posted on a public forum, please send it to PGI customer service (trs@pgroup.com) and ask them to forward it to me.

!$acc parallel
…other parallelizable cycles go here, each iteration j+1 depends on previous one j…
!$acc end parallel

If there is a backwards dependency, the code is not parallel. What does the -Minfo messages tell you about this loop?

  • Mat

Note, if the code is too long or you don’t want it posted on a public forum, please send it to PGI customer service (> trs@pgroup.com> ) and ask them to forward it to me

Mat, it took me several weeks to build and the run the code. It is a large project with a lot of prerequisites to compile and run. I don’t think it is a good idea to consider the whole project as reproduction example.

If there is a backwards dependency, the code is not parallel.

I know. That is why I put the second OpenACC parallel directive INSIDE inner loop. There is actually yet another inner loop (the third one) inside that second OpenACC parallel directive.

I don’t think it is a good idea to consider the whole project as reproduction example.

Ok. If it is a problem with the compiler, then hopefully the problem is also seen elsewhere so we can get it fixed in a future version. If it’s a problem with your code, then unfortunately there is not much we can do without an example.

  • Mat

Hello,

Not sure if it is related, but I have the same error running the WRF 3.2.1 for PGI Accelerator, using PGI 12.3:

taskid: 0 hostname: cn15
 Namelist dfi_control not found in namelist.input. Using registry defaults for variables in dfi_control
 Namelist tc not found in namelist.input. Using registry defaults for variables in tc
 Namelist scm not found in namelist.input. Using registry defaults for variables in scm
 Namelist fire not found in namelist.input. Using registry defaults for variables in fire
  Ntasks in X             1 , ntasks in Y             3
 --- NOTE: sst_update is 0, setting io_form_auxinput4 = 0 and auxinput4_interval = 0 for all domains
 --- NOTE: grid_fdda is 0 for domain      1, setting gfdda interval and ending time to 0 for that domain.
 --- NOTE: both grid_sfdda and pxlsm_soil_nudge are 0 for domain      1, setting sgfdda interval and ending time to 0 for that domain.
 --- NOTE: obs_nudge_opt is 0 for domain      1, setting obs nudging interval and ending time to 0 for that domain.
 --- NOTE: num_soil_layers has been set to      5
 WRF V3.2.1 MODEL
  *************************************
  Parent domain
  ids,ide,jds,jde             1          425            1          300
  ims,ime,jms,jme            -4          430           -4          107
  ips,ipe,jps,jpe             1          425            1          100
  *************************************
 DYNAMICS OPTION: Eulerian Mass Coordinate
  alloc_space_field: domain             1 ,     911328204  bytes allocated
  RESTART run: opening wrfrst_d01_2001-10-24_03_00_00 for reading
Timing for processing lateral boundary for domain        1:    0.07996 elapsed seconds.
 WRF NUMBER OF TILES FROM OMP_GET_MAX_THREADS =   1
 WRF TILE   1 IS      1 IE    425 JS      1 JE    100
 WRF NUMBER OF TILES =   1
call to cuMemFree returned error 700: Launch failed
CUDA driver version: 4020

I’m guessing the Accelerator mode for WRF I’m trying to run is exactly the code developed by Michael & Craig, that was working for them with earlier versions of compiler.

\

  • Dima.

Hi Dima,

I believe this a known issue where we unfortunately broke the Accelerator version of this code when implementing OpenACC. I know our team is working on fixing it but I don’t know the details. Craig knows more about it but he’s on vacation this week. I’ll ask him about it when he gets back.

  • Mat

Hi,

Did you solve the issue? Does the wrf 3.4 OpenAcc version work with PGI accelerator? Could it be run for the GPUs?

Thank you,
Cristina

Hi Cristina,

I was fairly certain that this had been fixed awhile ago. But just to be sure I just tried PGI 15.10 with WRF 3.4 and it worked fine.

  • Mat