Missing branch target block - what does this mean ?

Hi! What does this error mean?

#include<stdio.h>

void main(){
   int i,j,d=3;
   double p[10][10];
#pragma acc enter data create(p,d)
   for(i=0;i<10;i++) for(j=0;j<10;j++) p[i][j]=3.;
#pragma acc update device(p,d)
#pragma acc data     present( p[0:10][0:10],d )
{
#pragma acc region
{
  int i,j;  
  if( d == 3) {
    for(i=0;i<10;i++) for(j=0;j<10;j++) p[i][j]=0.;
  }
}
}
#pragma acc update host(p)
   printf("p[0][0]=%f\n",p[0][0]);
#pragma acc exit data delete(p)
}

$ pgcc -ta=tesla -Minfo=all -o i i.c
PGC-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Missing branch target block (i.c: 11)
main:
6, Generating enter data create(d,p[:][:])
9, Generating update device(d,p[:][:])
Generating present(p[:][:],d)
11, Accelerator scalar kernel generated
15, Conditional loop will be executed in scalar mode
20, Generating update host(p[:][:])
21, Generating exit data delete(p[:][:])
PGC/x86-64 Linux 15.5-0: compilation completed with severe errors
$

Also, the following code generates the error " Illegal address during kernel execution". Why ?

#include<stdio.h>

void main(){
   int i,j,d;
#pragma acc enter data create(d)
   d=3;
#pragma acc update device(d)
#pragma acc data present(d)
{
#pragma acc region
{
  int ii,jj;
  for(ii=0;ii<10;ii++) for(jj=0;jj<10;jj++) d++;
}
}
#pragma acc update host(d)
  printf("d=%d\n",d);
#pragma acc exit data delete(d)
}

$ pgcc -ta=tesla -o i i.c
$ ./i
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

For the first post, this is a compiler error so I added TPR#21733. Since a “kernels”, aka “region”, construct is looking for parallelism, it will try to offload the for loop but evaluate the if statement on the host. Since “d” is on the device, I think the compiler is getting confused here. You can work around it by not moving the scalar “d” to the device (i.e. remove it from the data directives).

Note that it’s typically better to not manage scalars. By putting a scalar in a data region you’ve made it global and shared by all threads. Often scalars should be first private, which is the default, and only in rare cases should be shared.

If you want to keep “d” in a data region, this is a case where the “parallel” construct would be better. With “parallel”, you are telling the compiler where to offload to the device, so it will move the if statement to the device. However, by doing this you’ll severely limit the parallelism since you’ll only have one gang and only the inner loop scheduled as a vector. It would be better to move the compute construct (kernels or parallel) around the “for” statement and not around the if statement.


For the second issue, you’re not actually creating valid device code. Here’s the compiler feedback messages:

% pgcc 06_16_15c.c -acc -Minfo=accel
main:
      5, Generating enter data create(d)
      8, Generating update device(d)
         Generating present(d)
     13, Accelerator restriction: scalar variable live-out from loop: d
         Accelerator scalar kernel generated
     14, Accelerator restriction: scalar variable live-out from loop: d
     15, Accelerator restriction: induction variable live-out from loop: d
     19, Generating update host(d)
     20, Generating exit data delete(d)

The problem is that since “d” gets updated by all threads, it can lead to a race condition where some threads may get an older copy of “d” before other threads have updated it. This is a case where you need to put “d” in a reduction clause so the correct device code is created.

Here’s an example. I also took the liberty of removing the data management of “d”.

% cat 06_16_15d.c
#include<stdio.h>

void main(){
   int i,j,d;
  int ii,jj;
   d=3;
#pragma acc kernels loop reduction(+:d)
  for(ii=0;ii<10;ii++) {
        for(jj=0;jj<10;jj++) {
          d++;
  }}
  printf("d=%d\n",d);
}
% pgcc 06_16_15d.c -acc -Minfo=accel
main:
      8, Loop is parallelizable
      9, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
          8, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
          9, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
             Sum reduction generated for d
% a.out
d=103

Hope this helps,
Mat