PGF90-F-0155-Compiler failed to translate accelerator region

Hi all,

I’m in the process of inlining a lot of code in hopes I can accelerate it effectively. On my first stab, I get this error though:

Stack dump:
0.      Running pass 'NVPTX DAG->DAG Pattern Instruction Selection' on function '@inner_iteration_acc_861_gpu'
pgnvd-Fatal-/opt/pgi/linux86-64/2013/cuda/5.0/nvvm/cicc TERMINATED by signal 11
Arguments to /opt/pgi/linux86-64/2013/cuda/5.0/nvvm/cicc
/opt/pgi/linux86-64/2013/cuda/5.0/nvvm/cicc -arch compute_20 -m64 -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 /tmp/pgnvdCqWdu1rbSc9z.i -o /tmp/pgcudaforCQUduIPB8M83.ptx
PGF90-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (Pij_GPU_Acc.f90: 1)
PGF90/x86-64 Linux 13.10-0: compilation aborted

The Minfo messages are pretty straightforward - some stuff is accelerated, some stuff can’t be and the normal live-out/Loop carried dependence/etc. messages are generated. I have never seen the above message though. Any ideas as to what is likely causing this?

Thanks,
Rob

Hi Rob,

It seams I saw this issue. Try to switch to cuda5.5

Alexey

Hi Rob,

This is an error with the backend CUDA 5.0 compiler. As Alexy suggest, try using CUDA 5.5 (-ta=nvidia,cuda5.5) to see if it has been fixed.

If not, please send PGI Customer Service (trs@pgroup.com) a reproducing example so we can report it to the CUDA team and possibly find you a work around.

On a side note, 14.1 will have initial support for the OpenACC 2.0 “routine” directive which will allow you to make routine calls instead of having to inline everything.

Best Regards,
Mat

Hi Mat,

The error persists when I use CUDA 5.5. I’ll see if I can make up an example, but it may take some time to strip it down.

While I have your ear, can I ask a completely unrelated question? Sometimes I have loosely nested loops:

do i = 1, N


   !some code


   foo = 0
   do j = x1, x2
       foo = foo + bar(j)
   enddo
enddo

Now what I’ve tried to do to accelerate it is:

!$acc kernels 
do i = 1, N   ! line 10


   !some code


   foo = 0
!$acc loop reduction(+:foo)
   do j = x1, x2    !line 30
       foo = foo + bar(j)
   enddo
enddo
!$acc end kernels

When I compile with -Minfo=accel, I get something like this:

    10, Loop is parallelizable
         Accelerator kernel generated
        10, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
    30, Loop is parallelizable

But no kernel is generated at line 30. Does it just decide that while it may be parallalizable, it’s better to only work with the outer loop? Or is there some scheduling that I have to do?

Thanks,
Rob

Hi Rob,

The default for “kernels” to work on tightly nested loops so you need to add a few more loop schedule clauses. Though, this might a case where “parallel” is more fitting is it’s default if for non-tightly nested loops like this one. Give this schedule a try:

!$acc parallel loop gang 
do i = 1, N   ! line 10 

   !some code 

   foo = 0 
!$acc loop vector reduction(+:foo) 
   do j = x1, x2    !line 30 
       foo = foo + bar(j) 
   enddo 
enddo
  • Mat

Hi Mat,

Thanks - it’s working now. I was initially using parallel regions, but encountered similar behaviour. For instance:

!$acc parallel
do i = 1, N   ! line 10 

   !some code 

   foo = 0 
!$acc loop vector reduction(+:foo) 
   do j = x1, x2    !line 30 
       foo = foo + bar(j) 
   enddo 
enddo 
!$acc end parallel

actually only generates a kernel for the inner loop, whereas ‘kernels’ only generates a kernel only for the outer loop. The only thing I hadn’t tried evidently was to use ‘parallel loop’.

Returning to my compilation error, I’m afraid I cannot isolate the problem into a stand alone test program (it goes away if I do). I’m also not permitted to distribute any source code. All I can tell you is that it looks something like this:

  integer :: nk                   
  integer :: n                    
  integer      :: m
  real :: b(45 , 46)
  logical      :: s

!$acc data create(b)
  b = 1.0
  nk = 30

  n = nk
  m = nk+1

!$acc kernels

  s = .FALSE.      

  r = 0
  if (n == 1) then                  
     if (b(1,1) == 0.) then
        s = .TRUE.
     else
        b(1,2:m) = b(1,2:m) / b(1,1)
     endif
     r = 1
  endif
!$acc end kernels
!$acc end data

(note that this doesn’t generate the error). I’ll admit that I sloppily threw “kernels” regions in on the first time around - I was able to get things to compile when I put the kernels statements around the b(1,2:m) = … line only.

Thanks,
Rob

Hi Rob,

That’s unfortunate about not being able to create a reproducing example. Without one, it’s very difficult to determine the cause especially with a generic error like a seg fault. Hopefully, we can work around the error by moving the OpenACC directives.

In looking at this code, I see one error in coding:

!$acc data create(b) 
  b = 1.0

Here you create the array on the device, then assign the value on the host with the device copy’s values are uninitialized. To fix, either swap the statements and use “copyin” instead of “create”, add an “update” directive, or put the assignment in a compute region.

b = 1.0
!$acc data copyin(b) 
.. or ..
!$acc data create(b) 
  b = 1.0 
!$acc update device(b)
.. or ..
!$acc data create(b) 
!$acc kernels
  b = 1.0 
!$acc end kernels

As for the “kernels” region, the compiler is correctly creating a sequential GPU kernel given that there is a “if” statement. In the OpenACC 2.0 spec which we’re currently in the process of implementing, you will be able to nest compute regions to exploit dynamic parallelism. The typical use case for dynamic parallelism is to transfer the control flow over to the GPU as a serial kernel which then launches one or more parallel kernels. So in this case, you could add a second “kernels” directive around the array syntax.

Though until nested compute regions are supported, you should just use “kernels” directive around the inner array assignment and maintain synchronization of “b” between the host and device.

  • Mat