Internal compiler error. load of zero symbol

Hello. I’m trying to compile module, that uses acc directives. Compilation fails with error.
Here is the module source(i’ve cut everything unrelated to error, that appears in full version of source)

      module FOURIN_acc
      use cudafor
      
      real, allocatable, device :: FSPC(:)
      integer, private :: NND2C=-1,NXC=-1,NSGNC = 0

      contains
      
                                            
      SUBROUTINE FOURTR(FIN,FOUT,TEMP,NPOINT,NLOG2)                     
      real, device :: FIN(NPOINT),FOUT(NPOINT),TEMP(NPOINT)                 
      NPP1=NPOINT+1                                                     

!$acc region 
      DO 3 K=1,KLIM                                                     
      DO 3 J=2,JLIM,2                        
!$acc do independent      
      DO I=1,2                                                       
      TEMP(JLIM*K+J-JLIM-1) = TEMP(JLIM*K+J-JLIM-1) +
     *  FIN(JLIM*K+J+I+NPOINT-JLIM-2)*FSPC(KLIM*J-2*KLIM+I) 
      EndDo
    3 CONTINUE                                                          
!$acc end region    
C         
!$acc region
      DO 35 K=1,KLIM                                                    
      DO 35 J=2,JLIM,2                                                  
!$acc do independent
      DO I=1,2                               
      F = FSPC(KLIM*J-2*KLIM+I+NPOINT)
      TEMP(JLIM*K+J-JLIM) = TEMP(JLIM*K+J-JLIM) +
     * FIN(JLIM*K+J+I+NPOINT-JLIM-2)*F
      End Do
   35 CONTINUE                                                          
!$acc end region                                    
                                                        
      RETURN                                                            
      END                                                               

      end module FOURIN_acc

Here is compilation result:

pgf95 -Mcuda=cuda3.2,cc11 -ta=nvidia,cc11 -tp=amd64 -Minfo -c fourin.for 
PGF90-S-0000-Internal compiler error. load of zero symbol       0 (fourin.for: 38)
PGF90-S-0000-Internal compiler error. load of zero symbol       0 (fourin.for: 38)
PGF90-S-0000-Internal compiler error. load of zero symbol       0 (fourin.for: 38)
PGF90-S-0000-Internal compiler error. load of zero symbol       0 (fourin.for: 38)
fourtr:
     15, Complex loop carried dependence of 'temp' prevents parallelization
     16, Complex loop carried dependence of 'temp' prevents parallelization
     18, Loop is parallelizable
         Accelerator kernel generated
         15, !$acc do seq
         16, !$acc do seq
             Using register for 'fspc'
         18, !$acc do parallel, vector(2) ! blockidx%x threadidx%x
     26, Complex loop carried dependence of 'temp' prevents parallelization
     27, Complex loop carried dependence of 'temp' prevents parallelization
     29, Loop is parallelizable
         Accelerator kernel generated
         26, !$acc do seq
         27, !$acc do seq
             Using register for 'fspc'
         29, !$acc do parallel, vector(2) ! blockidx%x threadidx%x
  0 inform,   0 warnings,   4 severes, 0 fatal for fourtr

I’m using PGI Fortran 11.3

Hi Senya,

Thanks for the report. I was able to recreate the problem here and have sent it on to our engineers for further investigation (TPR #17935).

The work around is to remove the device attribute from the FSPC variable and instead use Accelerator data regions or the mirrored clause to create a global device copy of FSPC.

  • Mat

But am i able to have an array, that is allocated on device, without having a copy on host?
Copying is huge time loss for my task.

But am i able to have an array, that is allocated on device, without having a copy on host?

If you use the “mirror” directive, the compiler will only allocate space on the device for FSPC. Data copies for mirrored variables must be done explicitly using the “update” directives.

So changing:

      module FOURIN_acc
      use cudafor
     
      real, allocatable, device :: FSPC(:)
      integer, private :: NND2C=-1,NXC=-1,NSGNC = 0

to

      module FOURIN_acc
     
      real, allocatable :: FSPC(:)
!$acc mirror(FSPC)
      integer, private :: NND2C=-1,NXC=-1,NSGNC = 0

will have the same effect as declaring it as a CUDA Fortran device allocatable.

Hope this helps,
Mat

Thanks.

Is memory gets allocated for host and device simultaneously if I use mirror?
Can I pass mirror arrays as function/subroutine arguments?
If i write

!$acc region
      FSPC(1) = 1.0
      FSPC(2) = 0
...

it will assign values to host or device array?
How can i access single elements of the FSPC on device?

Sorry for so many questions, but i didn’t find good documentation, that describes mirror directive.

Hi Senya,

Is memory gets allocated for host and device simultaneously if I use mirror?

Yes. When the variable is allocated, space is allocated in both host and device memory. They are not coherent, i.e. values contained in one are not automatically updated in the other. The “update” directive can be used to copy data between the two.

Can I pass mirror arrays as function/subroutine arguments?

Yes, provided that the routines have an explicit interface or are contained in a module that are visible during compilation.

it will assign values to host or device array?

Since the assignment is within a ACC compute region, it will get assigned to the device copy. However, you do need some parallel loop within the region, else the compiler wont generate a GPU kernel. Ideally you would initialize FSPC in a parallel loop. If you can’t, like if FSPC was initialized from a data in an external file, then it’s best to initialize the host copy and then use the “update” directive to initialize the device copy.

How can i access single elements of the FSPC on device?

You can reference FSPC normally within a compute region and since it’s mirrored, the correct device copy is used.

  • Mat

Hi Senya,

FYI, I just verified that TPR #17935 has been fixed in the 11.7 release.

  • Mat