Support for associate syntax in OpenACC kernel

Hi,

I just found a compiler bug when using Fortran 2003 feature associate in the scope of an ACC kernel.

Here is a very simple example for demonstrating this issue:

module mpoint
   type point
      real :: x, y, z
      real :: tmp
   end type point
 end module mpoint
 
 program main
   
   use mpoint
   
   implicit none
 
   integer, parameter :: n = 10
   real, allocatable :: array(:)
   !--------------------------------
   
   allocate(array(n))
   array(:) = 1.0

   call vecadd

   if(abs(array(2) - 9.0)>0.01) then
      write(*,*) 'GPU result is wrong!'
   else
      write(*,*) 'Test passed!'
   end if
 
   
 contains
   subroutine vecadd()
     integer :: i
     type(point) :: A

     associate( x => A%x, y => A%y, z => A%z, tmp => A%tmp)

     !$acc parallel loop
     do i = 1, n
        x = i
        y = i + 1
        z = i + 2
        tmp = x + y + z
        array(i) = tmp
     enddo

     !!$acc parallel loop
     !do i = 1, n
     !   A%x = i
     !   A%y = i + 1
     !   A%z = i + 2
     !   A%tmp = A%x + A%y + A%z
     !   array(i) = A%tmp
     !enddo

     end associate
   end subroutine vecadd
   
 end program main

Compiled with nvfortran -acc -Minfo, it will crash at runtime:

Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

However, if I comment out the loop and access the derived type component with a % sign (as shown above in the commented out part), it will work.

In my understanding, the associate statement can be implemented by the compiler as simple as a text replacement before any IR being generated. Based on the experiment here, clearly nvfortran has something more complicated.

Thanks hyzhou. Though, I wouldn’t call this a bug necessarily since not all Fortran constructs can be offloaded to a GPU. The problem being that these become references back to a host variable so will be tricky to translate on the device. I added a RFE, TPR #29107, to see what, if anything, our engineers can to get this working as you expect.

Note for commented out parallel loop, be sure to privatize “A” so you don’t get a race condition. Also, “k” is uninitialized so you’re likely to get an error.

-Mat

I have edited the code to remove the k index which was introduced when I tested the nested loops.

Based on the compiler information, it seems like for the commented out loop it automatically recognizes A as private. But to make sure it does, it’s definitely good to explicitly state that.

Thanks!

I have a more general question with respect to the error message

If I understand this correctly, this means that the machine code has an issue matching the memory addresses between CPU and GPU? How can I quickly identify which variable is causing this problem?

it seems like for the commented out loop it automatically recognizes A as private

Since “A” is an aggregate type, it will be implicitly shared. So without the “private(A)” clause, you should see a message like the following in the feedback showing that it’s getting implicitly copied.

>     % nvfortran -acc -Minfo=accel assoc.F90
>     vecadd:
>          45, Generating Tesla code
>              46, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
>          45, Generating implicit copyout(array(1:128),a%tmp(:1)) [if not already present]
>              Generating implicit copyin(a) [if not already present]

If I understand this correctly, this means that the machine code has an issue matching the memory addresses between CPU and GPU? How can I quickly identify which variable is causing this problem?

Illegal address errors are generic error indicating that a bad address is being accessed. It’s similar to a seg fault on the host. There are a number of things that could trigger it, such as accessing a host address on the device, not doing a deep copy of aggregate types with dynamic data members, accessing an array out-of-bounds, device stack or heap overflow, misaligned address, etc.

You can use cuda-gdb, though while we do provide Dwarf information, the code is often highly optimized so not easy to debug. Plus, cuda-gdb isn’t great with Fortran code. Though, I have found it useful on some occasions so will usually start there.

In a larger code with many compute constructs, I’ll typically start by setting the environment variable “NV_ACC_NOTIFY=1” (or PGI_ACC_NOTIFY=1 if you’re using a PGI release). This prints out each of the kernel launches so you can determine which one is causing the error. The kernels are named “routine_name_lineno” so it’s easy to correlate back the source. From there, I’ll inspect the routine for possible causes and the comment out code, or use print statements to better categorize what could be happening.

Unfortunately is very situational after that. Though, feel free to post questions if you get stuck.

-Mat

1 Like