Fatal Usage Error with simple 'mirror' & ACC_DEVICE=HOST

escj · September 22, 2012, 11:49pm

Hello , I’m using the last pgi/12.8 ( all version have the same problem ) .

A sample ‘mirror’ example compiled in unified mode

pgf90 -g -ta=nvidia,host  acc_mirror.f90  -o acc_mirror_unified

give a Fatal Error when launched on the host with ACC_DEVICE=HOST

ACC_DEVICE=HOST acc_mirror_unified
Fatal Usage Error: __pgi_acc_mirrorall2 called before __pgi_cu_init

No problem on the device ( GTX470 )

ACC_DEVICE=NVIDIA acc_mirror_unified
X= 3.141500

Here is the sample code ( extracted from the real code )

MODULE my_data
  IMPLICIT NONE
  real , allocatable, dimension(:) :: XA
  !$acc mirror(XA)
END MODULE my_data

PROGRAM test_mirror_host

  USE my_data

  IMPLICIT NONE

  INTEGER, PARAMETER :: NX=64

  allocate (XA(NX))

  !$acc region
  XA = 3.1415
  !$acc end region

  !$acc update host(XA(NX:NX))
  print *,"X=", XA(NX)

END PROGRAM test_mirror_host

A+
Juan

MatColgrove · September 24, 2012, 4:56pm

Hi Juan,

Hmm, it doesn’t look like this has ever worked. I submitted a problem report (TPR#18938) and will have engineering see what they can do. The work around would be to use a data region rather then mirror:

% cat acc_mirror1.f90 
MODULE my_data
  IMPLICIT NONE
  real , allocatable, dimension(:) :: XA
  !acc mirror(XA)
END MODULE my_data

PROGRAM test_mirror_host

  USE my_data

  IMPLICIT NONE

  INTEGER, PARAMETER :: NX=64

  allocate (XA(NX))
!$acc data region copyout(XA)
  !$acc region
  XA = 3.1415
  !$acc end region
!$acc end data region

  print *,"X=", XA(NX)

END PROGRAM test_mirror_host 
% pgf90 -ta=host,nvidia -Minfo acc_mirror1.f90 -o acc_mirror_unified -V12.9 ; acc_mirror_unified
test_mirror_host:
      7, PGI Unified Binary version for -tp=nehalem-64 -ta=host
     18, Memory set idiom, loop replaced by call to __c_mset4
test_mirror_host:
      7, PGI Unified Binary version for -tp=nehalem-64 -ta=nvidia
     16, Generating copyout(xa(:))
     17, Generating present_or_copyout(xa(:))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     18, Loop is parallelizable
         Accelerator kernel generated
         18, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
             CC 1.0 : 7 registers; 44 shared, 0 constant, 0 local memory bytes
             CC 2.0 : 11 registers; 0 shared, 60 constant, 0 local memory bytes
 X=    3.141500

Thanks again,
Mat

escj · September 25, 2012, 2:25pm

Hello Mat .

OK for the data region ( it was my previous version ) , but I’m in an optimization phase at that point …

The XA work buffer is used many time in the code, typically for halo exchange with MPI …

So I try to allocate it one for all the duration of the code …

I have already done it for the CPU part …
… but for the GPU part if I put it in a “data region” I think that the GPU memory is allocated and free every time the code enter & exit the data region , so a lot of time lost …

… it right no ?
… and the mirror clause is just what I need …

A+

Juan

MatColgrove · September 25, 2012, 3:36pm

… but for the GPU part if I put it in a “data region” I think that the GPU memory is allocated and free every time the code enter & exit the data region , so a lot of time lost …
… it right no ?
… and the mirror clause is just what I need …

Correct. You can still use mirror, but just not within a unified binary.

Mat

tull · May 18, 2013, 12:11am

Juan,

TPR 18938 was fixed in the 13.2 release.

Thanks for the submission.

dave

escj · May 22, 2013, 12:44pm

Thank you Dave( and all PGI staff ) for all the bugfix and closing the different issue .

A+

Juan