EventSynchronize returned error 700

Hi,

I have an example code here:

      program testsum

      implicit none
      integer::i,j,k,lj,rj
      integer::a(10,3),suma(5,3)
      
!$acc data region local(a,suma)
      do k=1,500

!$acc region
         do i=1,10
            a(i,1)=i
            a(i,2)=i*2
            a(i,3)=i*3
         end do
!$acc end region

!$acc region
         do j=1,5
            lj=(j-1)*2 + 1
            rj=j*2
            suma(j,1)=sum(a(lj:rj,1))
            suma(j,2)=sum(a(lj:rj,2))
            suma(j,3)=sum(a(lj:rj,3))
         end do
!$acc end region

      end do
!$acc updateout(suma)
!$acc end data region

      write(*,*)suma(:,1)
      end program

It fails to run:

call to EventSynchronize returned error 700: Launch failed
CUDA driver version: 2030

Accelerator Kernel Timing data
/mnt/home/../test.f
  testsum
    18: region entered 138 times
        time(us): total=4499 init=14 region=4485
                  kernels=1499 data=0
        w/o init: total=4485 max=136 min=31 avg=32
        19: kernel launched 138 times
            grid: [1]  block: [5]
            time(us): total=1499 max=12 min=10 avg=10
/mnt/home/../test.f
  testsum
    10: region entered 138 times
        time(us): total=4142 init=19 region=4123
                  kernels=1109 data=0
        w/o init: total=4123 max=134 min=28 avg=29
        11: kernel launched 138 times
            grid: [1]  block: [10]
            time(us): total=1109 max=13 min=7 avg=8
/mnt/home/../test.f
  testsum
    7: region entered 1 time
        time(us): init=696724

Can someone explain this error?

Thanks.

Hi WENYANG LIU,

Thanks for the report. I’ll need to pass this one on to our compiler engineers for further investigation (I added TPR#17300).

In the mean time, one work around is to have each sum containing in their own loop. For example:

% cat test2.f90 

      program testsum

      implicit none
      integer::i,j,k,lj,rj
      integer::a(10,3),suma(5,3)

       suma=0     
!$acc data region local(a,suma)
      do k=1,50

!$acc region
         do i=1,10
            a(i,1)=i
            a(i,2)=i*2
            a(i,3)=i*3
         end do
!$acc end region

!$acc region
         do j=1,5
            lj=(j-1)*2 + 1
            rj=j*2
            suma(j,1)=sum(a(lj:rj,1))
         end do
!$acc end region

!$acc region
         do j=1,5
            lj=(j-1)*2 + 1
            rj=j*2
            suma(j,2)=sum(a(lj:rj,2))
         end do
!$acc end region

!$acc region
         do j=1,5
            lj=(j-1)*2 + 1
            rj=j*2
            suma(j,3)=sum(a(lj:rj,3))
         end do
!$acc end region

      end do
!$acc updateout(suma)
!$acc end data region

      write(*,*)suma(:,1)
      write(*,*)suma(:,2)
      write(*,*)suma(:,3)
      end program 
% pgf90 -ta=nvidia -Minfo=accel test2.f90 -V10.9 ; a.out
testsum:
      9, Generating local(suma(:,:))
         Generating local(a(:,:))
     12, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     13, Loop is parallelizable
         Accelerator kernel generated
         13, !$acc do parallel, vector(10)
             CC 1.0 : 6 registers; 20 shared, 16 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 6 registers; 20 shared, 16 constant, 0 local memory bytes; 25 occupancy
     20, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     21, Loop is parallelizable
         Accelerator kernel generated
         21, !$acc do parallel, vector(5)
             CC 1.0 : 6 registers; 20 shared, 28 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 6 registers; 20 shared, 28 constant, 0 local memory bytes; 25 occupancy
     24, Loop is parallelizable
     28, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     29, Loop is parallelizable
         Accelerator kernel generated
         29, !$acc do parallel, vector(5)
             CC 1.0 : 6 registers; 20 shared, 28 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 6 registers; 20 shared, 28 constant, 0 local memory bytes; 25 occupancy
     32, Loop is parallelizable
     36, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     37, Loop is parallelizable
         Accelerator kernel generated
         37, !$acc do parallel, vector(5)
             CC 1.0 : 6 registers; 20 shared, 28 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 6 registers; 20 shared, 28 constant, 0 local memory bytes; 25 occupancy
     40, Loop is parallelizable
     45, Generating !$acc update host(suma(:,:))
            3            7           11           15           19
            6           14           22           30           38
            9           21           33           45           57

Best Regards,
Mat

Hello ,

I am getting a similar problem for this code:

	

#define M     256*4
#define N     256*4
#define K     256*4

int t1, t2, t3;
	#pragma acc region copyin( PGI_OPT_C[0:N-1 ][0:N-1], PGI_OPT_B[0:N-1 ][0:N-1], PGI_OPT_A[0:N-1 ][0:N-1] )//, copyout( PGI_OPT_C[ 0:N*N-1  ] ) 
	{ 
	  for (t1=0;t1<=N-1;t1++) 
	  {
	    for (t2=0;t2<=N-1;t2++) 
	    {
	      for (t3=0;t3<=N-1;t3++) 
	      {
		 PGI_OPT_C[ t1*N + t2]= PGI_OPT_C[ t1*N + t2 ] + PGI_OPT_A[ t1*N + t3 ]*PGI_OPT_B[ t3*N + t2];
		 //PGI_OPT_C[ t1][ t2] = PGI_OPT_C[ t1][ t2 ] + PGI_OPT_A[ t1 ][ t3 ]*PGI_OPT_B[ t3][ t2 ];
		//PGI_OPT_C[ t1 ] = PGI_OPT_C[ t1 ]*2 ;
	      }
	    }
	  }	
	}

Oddly enough, this only happens when M , N , K are defined with sizes greater than 1024.

Also If I set the size to be 256*256 I get a compilation error:

PGC-F-0000-Internal compiler error. pgnvd job exited with nonzero status code       0 (main.c: 168)

Any thoughts on this ?

Hi sameer.asal,

Can you please post or send to PGI Customer Service (trs@pgroup.com) a complete reproducing example? I don’t see anything unusual about your code snip-it so will the complete example to determine the problem. Other useful information would be what type of device you are using and the compiler version.

“700” is a generic error and is most likely unrelated to Wenyang Liu’s issue. I’ve seen it most when their is a seg fault when copying data to the device. Check your array bounds match the copy clause.

  • Mat

Hi

Here is the sample code on jacobi iteration. Though it gets compiled OK, doesnt get executed OK. I am using PGI 12.3 and tried with 12.4 as well. n and m are 512

Code snippet:
#pragma omp parallel for shared(Anew)
for (int i = 1; i < m; i++)
{
Anew[0] _= 0.f;
Anew[n-1] = 0.f;
}
#pragma omp parallel for shared(Anew)
for (int j = 1; j < n; j++)
{
Anew[j][0] = y0[j];
Anew[j][m-1] = y0[j]*expf(-pi);
}

while ( error > tol && iter < iter_max )
{
error = 0.f;

#pragma omp parallel for shared(m, n, Anew, A)
#pragma acc kernels
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j] = 0.25f * ( A[j][i+1] + A[j][i-1]

  • A[j-1] + A[j+1]);
    error = fmaxf( error, fabsf(Anew[j]-A[j]));
    }
    }

    #pragma omp parallel for shared(m, n, Anew, A)
    #pragma acc kernels
    for( int j = 1; j < n-1; j++)
    {
    for( int i = 1; i < m-1; i++ )
    {
    A[j] = Anew[j];
    }
    }

    On execution:
    ./laplace2d_acc
    Jacobi relaxation Calculation: 4096 x 4096 mesh
    call to EventSynchronize returned error 700: Launch failed
    CUDA driver version: 4020

    Accelerator Kernel Timing data
    /home/sunita/tests/openacc/parallel-forall-code-samples-78e5f9b/posts/002-openacc-example/step1/laplace2d.c
    main
    74: region entered 1 time
    time(us): init=119031
    data=29051
    And if
    #pragma acc data copy(A, Anew) added before
    while ( error > tol && iter < iter_max )
    {
    error = 0.f;

    then I get the following error

    Jacobi relaxation Calculation: 4096 x 4096 mesh
    call to cuMemcpyDtoH returned error 700: Launch failed
    CUDA driver version: 4020

    Please let me know if there is a problem in the code ?

    Sunita_

Hi Sunita,

OpenACC requires that arrays be contiguous in memory. Hence, arrays of pointers can not be used. How are A and Anew declared?

  • Mat

Hi Mat

Thanks for the reply. A and Anew are declared as arrays and using memset is used to set the values of the array…

float A[n][m];
float Anew[n][m];
float y0[n];

memset(A, 0, n * m * sizeof(float));

Sunita

Hi Sunita,

The declarations look fine, so I tried to create an example from what you posted, but my example ran correctly. Can you please post a complete reproducing example?

Thanks,
Mat