Missing branch target block - what does this mean ?

barash · June 16, 2015, 9:14am

Hi! What does this error mean?

#include<stdio.h>

void main(){
   int i,j,d=3;
   double p[10][10];
#pragma acc enter data create(p,d)
   for(i=0;i<10;i++) for(j=0;j<10;j++) p[i][j]=3.;
#pragma acc update device(p,d)
#pragma acc data     present( p[0:10][0:10],d )
{
#pragma acc region
{
  int i,j;  
  if( d == 3) {
    for(i=0;i<10;i++) for(j=0;j<10;j++) p[i][j]=0.;
  }
}
}
#pragma acc update host(p)
   printf("p[0][0]=%f\n",p[0][0]);
#pragma acc exit data delete(p)
}

$ pgcc -ta=tesla -Minfo=all -o i i.c
PGC-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Missing branch target block (i.c: 11)
main:
6, Generating enter data create(d,p[:][:])
9, Generating update device(d,p[:][:])
Generating present(p[:][:],d)
11, Accelerator scalar kernel generated
15, Conditional loop will be executed in scalar mode
20, Generating update host(p[:][:])
21, Generating exit data delete(p[:][:])
PGC/x86-64 Linux 15.5-0: compilation completed with severe errors
$

barash · June 16, 2015, 9:40am

Also, the following code generates the error " Illegal address during kernel execution". Why ?

#include<stdio.h>

void main(){
   int i,j,d;
#pragma acc enter data create(d)
   d=3;
#pragma acc update device(d)
#pragma acc data present(d)
{
#pragma acc region
{
  int ii,jj;
  for(ii=0;ii<10;ii++) for(jj=0;jj<10;jj++) d++;
}
}
#pragma acc update host(d)
  printf("d=%d\n",d);
#pragma acc exit data delete(d)
}

$ pgcc -ta=tesla -o i i.c
$ ./i
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

MatColgrove · June 16, 2015, 4:08pm

For the first post, this is a compiler error so I added TPR#21733. Since a “kernels”, aka “region”, construct is looking for parallelism, it will try to offload the for loop but evaluate the if statement on the host. Since “d” is on the device, I think the compiler is getting confused here. You can work around it by not moving the scalar “d” to the device (i.e. remove it from the data directives).

Note that it’s typically better to not manage scalars. By putting a scalar in a data region you’ve made it global and shared by all threads. Often scalars should be first private, which is the default, and only in rare cases should be shared.

If you want to keep “d” in a data region, this is a case where the “parallel” construct would be better. With “parallel”, you are telling the compiler where to offload to the device, so it will move the if statement to the device. However, by doing this you’ll severely limit the parallelism since you’ll only have one gang and only the inner loop scheduled as a vector. It would be better to move the compute construct (kernels or parallel) around the “for” statement and not around the if statement.

For the second issue, you’re not actually creating valid device code. Here’s the compiler feedback messages:

% pgcc 06_16_15c.c -acc -Minfo=accel
main:
      5, Generating enter data create(d)
      8, Generating update device(d)
         Generating present(d)
     13, Accelerator restriction: scalar variable live-out from loop: d
         Accelerator scalar kernel generated
     14, Accelerator restriction: scalar variable live-out from loop: d
     15, Accelerator restriction: induction variable live-out from loop: d
     19, Generating update host(d)
     20, Generating exit data delete(d)

The problem is that since “d” gets updated by all threads, it can lead to a race condition where some threads may get an older copy of “d” before other threads have updated it. This is a case where you need to put “d” in a reduction clause so the correct device code is created.

Here’s an example. I also took the liberty of removing the data management of “d”.

% cat 06_16_15d.c
#include<stdio.h>

void main(){
   int i,j,d;
  int ii,jj;
   d=3;
#pragma acc kernels loop reduction(+:d)
  for(ii=0;ii<10;ii++) {
        for(jj=0;jj<10;jj++) {
          d++;
  }}
  printf("d=%d\n",d);
}
% pgcc 06_16_15d.c -acc -Minfo=accel
main:
      8, Loop is parallelizable
      9, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
          8, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
          9, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
             Sum reduction generated for d
% a.out
d=103

Hope this helps,
Mat

Topic		Replies	Views
Accelerator Fatal Error: No NVIDIA/CUDA version... Legacy PGI Compilers	12	14632	May 15, 2017
Illegal address error when passing vector by reference Legacy PGI Compilers	1	1192	April 3, 2019
Missing branch target block (PGI Accelerator) Legacy PGI Compilers	2	3438	January 19, 2012
FATAL ERROR at run time Legacy PGI Compilers	5	8093	December 18, 2014
error 700: Illegal address during kernel execution Legacy PGI Compilers	6	5007	April 16, 2019
PGF90-F-0155-Compiler failed to translate accelerator region Legacy PGI Compilers	6	9262	December 6, 2013
OpenACC routine call inside OpenMP parallel loop Legacy PGI Compilers	7	1122	October 12, 2021
OpenACC reporting "Illegal address during kernel execut Legacy PGI Compilers	5	14411	January 12, 2017
Illegal address during kernel execution for large grids Legacy PGI Compilers	2	9337	June 23, 2014
acc kernels / acc parallel question Legacy PGI Compilers	2	3857	September 1, 2017

Missing branch target block - what does this mean ?

Related topics