give a Fatal Error when launched on the host with ACC_DEVICE=HOST
ACC_DEVICE=HOST acc_mirror_unified
Fatal Usage Error: __pgi_acc_mirrorall2 called before __pgi_cu_init
No problem on the device ( GTX470 )
ACC_DEVICE=NVIDIA acc_mirror_unified
X= 3.141500
Here is the sample code ( extracted from the real code )
MODULE my_data
IMPLICIT NONE
real , allocatable, dimension(:) :: XA
!$acc mirror(XA)
END MODULE my_data
PROGRAM test_mirror_host
USE my_data
IMPLICIT NONE
INTEGER, PARAMETER :: NX=64
allocate (XA(NX))
!$acc region
XA = 3.1415
!$acc end region
!$acc update host(XA(NX:NX))
print *,"X=", XA(NX)
END PROGRAM test_mirror_host
Hmm, it doesn’t look like this has ever worked. I submitted a problem report (TPR#18938) and will have engineering see what they can do. The work around would be to use a data region rather then mirror:
% cat acc_mirror1.f90
MODULE my_data
IMPLICIT NONE
real , allocatable, dimension(:) :: XA
!acc mirror(XA)
END MODULE my_data
PROGRAM test_mirror_host
USE my_data
IMPLICIT NONE
INTEGER, PARAMETER :: NX=64
allocate (XA(NX))
!$acc data region copyout(XA)
!$acc region
XA = 3.1415
!$acc end region
!$acc end data region
print *,"X=", XA(NX)
END PROGRAM test_mirror_host
% pgf90 -ta=host,nvidia -Minfo acc_mirror1.f90 -o acc_mirror_unified -V12.9 ; acc_mirror_unified
test_mirror_host:
7, PGI Unified Binary version for -tp=nehalem-64 -ta=host
18, Memory set idiom, loop replaced by call to __c_mset4
test_mirror_host:
7, PGI Unified Binary version for -tp=nehalem-64 -ta=nvidia
16, Generating copyout(xa(:))
17, Generating present_or_copyout(xa(:))
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
18, Loop is parallelizable
Accelerator kernel generated
18, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
CC 1.0 : 7 registers; 44 shared, 0 constant, 0 local memory bytes
CC 2.0 : 11 registers; 0 shared, 60 constant, 0 local memory bytes
X= 3.141500
OK for the data region ( it was my previous version ) , but I’m in an optimization phase at that point …
The XA work buffer is used many time in the code, typically for halo exchange with MPI …
So I try to allocate it one for all the duration of the code …
I have already done it for the CPU part …
… but for the GPU part if I put it in a “data region” I think that the GPU memory is allocated and free every time the code enter & exit the data region , so a lot of time lost …
… it right no ?
… and the mirror clause is just what I need …
… but for the GPU part if I put it in a “data region” I think that the GPU memory is allocated and free every time the code enter & exit the data region , so a lot of time lost …
… it right no ?
… and the mirror clause is just what I need …
Correct. You can still use mirror, but just not within a unified binary.