error 702: Launch timeout happens non-deterministically

odranoel · October 16, 2012, 10:34pm

(I’m using PGi 12.8 on Linux 64, with a GeForce GTX 280 and CUDA 4.1)

I’m doing some experiments with OpenACC, and this is puzzling me:

I had the following code to perform matrix multiplications:

typedef float ff;

void mmul(const restrict ff* a,
          const restrict ff* b,
          restrict ff* c,
          const int n) {
#pragma acc data copyin(a[0:n*n], b[0:n*n]) copy(c[0:n*n])
{

#pragma acc region
{

#pragma acc loop independent vector(16)
  for (int i = 0; i < n; ++i) {
#pragma acc loop independent vector(16)
    for (int j = 0; j < n; ++j) {
      ff sum = 0;
      for (int k = 0; k < n; ++k) {
        sum += a[i + n * k] * b[k + n * j];
      }
      c[i + n * j] = sum;
    }
  }

}
}
}

This code runs well, but I’m looking to optimize it.
I then do a small transformation:

void mmul(const restrict ff* a,
          const restrict ff* b,
          restrict ff* c,
          const int n) {

#pragma acc data copyin(a[0:n*n], b[0:n*n]) copy(c[0:n*n])
{

#pragma acc region
{

  for (int is = 0; is < n; is += 32) {
#pragma acc loop independent
    for (int i = is; i < is+32; ++i) {
#pragma acc loop independent
      for (int j = 0; j < n; ++j) {
        ff sum = 0;
        for (int k = 0; k < n; ++k) {
          sum += a[i + n * k] * b[k + n * j];
        }
        c[i + n * j] = sum;
      }
    }
  }
}
}
}

I simply added an external for loop, but the iteration remains basically the same.

While this isn’t by itself an optimization, the result is very strange: about half of the times I run this code I get the following error:

call to ctxSynchronize/after/__pgi_cu_uploadx returned error 702: Launch timeout

The other half of the times it simply runs, in about 8 seconds (for 1024x1024 sized matrices).
For smaller matrices it always works, so I suppose there might be a timeout issue here.

I’m not worried with the performance here, but want to understand this strange behaviour.

MatColgrove · October 16, 2012, 11:23pm

Hi lechat,

Let’s look at the compiler feedback messages for these two loops:

mmul:
     11, Generating copyout(c[0:n*n])
         Generating copyin(b[0:n*n])
         Generating copyin(a[0:n*n])
     14, Generating present_or_copyout(c[0:n*n])
         Generating present_or_copyin(a[0:n*n])
         Generating present_or_copyin(b[0:n*n])
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     18, Loop is parallelizable
     20, Loop is parallelizable
         Accelerator kernel generated
         18, #pragma acc loop gang, vector(16) /* blockIdx.x threadIdx.x */
         20, #pragma acc loop gang, vector(16) /* blockIdx.y threadIdx.y */
             CC 1.0 : 20 registers; 64 shared, 8 constant, 0 local memory bytes
             CC 2.0 : 22 registers; 0 shared, 80 constant, 0 local memory bytes
     22, Loop is parallelizable
mmul2:
     38, Generating copyin(b[0:n*n])
         Generating copyin(a[0:n*n])
         Generating copy(c[0:n*n])
     41, Generating present_or_copy(c[0:n*n])
         Generating present_or_copyin(a[0:n*n])
         Generating present_or_copyin(b[0:n*n])
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     44, Complex loop carried dependence of '*(c)' prevents parallelization
         Loop carried dependence of '*(c)' prevents parallelization
         Loop carried backward dependence of '*(c)' prevents vectorization
         Complex loop carried dependence of '*(b)' prevents parallelization
         Complex loop carried dependence of '*(a)' prevents parallelization
         Accelerator kernel generated
         44, CC 1.0 : 20 registers; 64 shared, 8 constant, 0 local memory bytes
             CC 2.0 : 22 registers; 0 shared, 80 constant, 0 local memory bytes
         46, #pragma acc loop vector(32) /* threadIdx.x */
         Loop is parallelizable
     48, Loop is parallelizable
     50, Loop is parallelizable
main:
     89, Generating present_or_copyin(B[0:size][0:size])
         Generating present_or_copyin(A[0:size][0:size])

For the first loop, you get a nice 2D gang (grid) with a 2D vector (thread block). However for the second because of the loop carried dependency (the compiler can’t tell independence of computed array indices), only a single gang is used with a single 1D vector. To fix, you need to add “independent” to the outer loop and add some schedule clauses:

#pragma acc region
{

#pragma acc loop independent gang 
  for (int is = 0; is < n; is += 32) {
#pragma acc loop independent vector (32) 
    for (int i = is; i < is+32; ++i) {
#pragma acc loop independent vector (16)
      for (int j = 0; j < n; ++j) {
        ff sum = 0;
        for (int k = 0; k < n; ++k) {
          sum += a[i + n * k] * b[k + n * j];
        }
        c[i + n * j] = sum;
      }
    }

For smaller matrices it always works, so I suppose there might be a timeout issue here.

Most likely X is killing your run. Is your GTX280 attached to a monitor?

Hope this helps,
Mat

odranoel · October 17, 2012, 4:27am

Yes, it helps. Thanks Mat.

Neldan · February 14, 2013, 11:20am

i have the same problem, but my devices are not attached to display

My cuda devices are:

GTX 580
GTX 460
TESLA C2075

my code:

       #pragma acc data copyin(m1[0:numFilas1][0:numColumnas1],m2[0:numFilas2][0:numColumnas2]), copyout(resultado[0:numFilas1][0:numFilas2])
        {

                int i,j;

                #pragma omp parallel for default(shared)
                #pragma acc kernels
                for (i=0;i<numFilas1;i++)
                {
                        #pragma omp parallel for
                        #pragma acc loop
                        for(j=0;j<numFilas2;j++)
                        {
                                int k = 0;
                                real_t acumulador = 0;

                                for(k=0;k<numColumnas1;k++)
                                        acumulador += m1[i][k] * m2[j][k];
                                resultado[i][j] = acumulador;
                        }
                }
        }

my code with your suggested changes:

        #pragma acc data copyin(m1[0:numFilas1][0:numColumnas1],m2[0:numFilas2][0:numColumnas2]), copyout(resultado[0:numFilas1][0:numFilas2])
        {

                #pragma acc region
                {
                        int i,j;
                        #pragma omp parallel for default(shared)
                        #pragma acc loop independent
                        for (i=0;i<numFilas1;i++)
                        {
                                #pragma omp parallel for
                                #pragma acc loop independent 
                                for(j=0;j<numFilas2;j++)
                                {
                                        int k = 0;
                                        real_t acumulador = 0;

                                        for(k=0;k<numColumnas1;k++)
                                                acumulador += m1[i][k] * m2[j][k];
                                        resultado[i][j] = acumulador;
                                }
                        }
                }
        }

on gtx 580 and 460, the execution does fail by timeout, and only with the tesla device seems to end at 40 seconds

the data test i used are 5000x5000 sized matrix