pgi13.3 PB , mirror/host data asynchronously updated ??

Hello .

A strange/random problem with a simple case of management of Module data with mirror declaration and allocation via a module routine .

No problem with pgi/13.1
Problem with pgi/13.3

Test done on a GeForce GTX 470 with cuda5.0

After the update host the result must by 100.0 ( randomly some time the answer could be OK )

 test_update
 BEFORE UPDATE AA=    0.000000    
 AFTER  UPDATE AA=    0.000000

If run with PGI_ACC_SYNCHRONOUS=1, all is OK again … but the code don’t use any async directive !

 
 PGI_ACC_SYNCHRONOUS=1 test_update
 BEFORE UPDATE AA=    0.000000    
 AFTER  UPDATE AA=    100.0000

The answer is also OK , if the allocate is done directly in the main program ( but not usable in a real application with hundred variable )


Compilation

 pgf90 -ta=nvidia test_update.f90 -o test_update

Source

 
MODULE MODD_DATA
IMPLICIT NONE
  INTEGER, PARAMETER :: Nvec=2048
  REAL , ALLOCATABLE, DIMENSION(:) :: AA
  !$acc mirror(AA)
CONTAINS

  SUBROUTINE ALLOC_DATA_MODULE()
    IMPLICIT NONE
    ALLOCATE( AA(Nvec) )
  END SUBROUTINE ALLOC_DATA_MODULE

  SUBROUTINE INIT_DATA(XTAB,XVAL)
    IMPLICIT NONE
    REAL , DIMENSION(:)              :: XTAB   
    !$acc reflected (XTAB)
    REAL                             :: XVAL

    !$acc kernels
    XTAB = XVAL
    !$acc end kernels
 
  END SUBROUTINE INIT_DATA
END MODULE MODD_DATA

PROGRAM TEST_ASYNC
USE MODD_DATA
  IMPLICIT NONE

  CALL ALLOC_DATA_MODULE()
!!$    ALLOCATE( AA(Nvec) )

  CALL INIT_DATA(AA, 10.0 )

  print*, "BEFORE UPDATE AA=" , AA(Nvec) ;  call flush(6)

  !$acc update host(AA)
  AA(Nvec) =   AA(Nvec) * 10.0
  print*, "AFTER  UPDATE AA=" , AA(Nvec) ; call flush(6)

END PROGRAM TEST_ASYNC

A+

Juan

Hi Juan,

I think what’s happening is that the update isn’t completing before the variable is being used in the calculation. However, I’m only able to reproduce this issue in 13.2. 13.3 seems ok. Granted, I was getting intermittent correct answers with 13.2 until I initialize the host copy, so 13.3 may be a false positive as well, but I’d like you to please double check.

Thanks,
Mat


% cat test_update.f90 
MODULE MODD_DATA
IMPLICIT NONE
  INTEGER, PARAMETER :: Nvec=2048
  REAL , ALLOCATABLE, DIMENSION(:) :: AA
  !$acc mirror(AA)
CONTAINS

  SUBROUTINE ALLOC_DATA_MODULE()
    IMPLICIT NONE
    ALLOCATE( AA(Nvec) )
  END SUBROUTINE ALLOC_DATA_MODULE

  SUBROUTINE INIT_DATA(XTAB,XVAL)
    IMPLICIT NONE
    REAL , DIMENSION(:)              :: XTAB   
    !$acc reflected (XTAB)
    REAL                             :: XVAL

    !$acc kernels
    XTAB = XVAL
    !$acc end kernels
 
  END SUBROUTINE INIT_DATA
END MODULE MODD_DATA

PROGRAM TEST_ASYNC
USE MODD_DATA
  IMPLICIT NONE
  CALL ALLOC_DATA_MODULE()
!!$    ALLOCATE( AA(Nvec) )

  CALL INIT_DATA(AA, 10.0 )
  AA=10.0
  !acc update host(AA)  << Moving the host update here works around the issue
  print*, "BEFORE UPDATE AA=" , AA(Nvec) ;  call flush(6)
  
  !$acc update host(AA)
  !acc wait   << This seems to also work around the issue in 13.2

  AA(Nvec) =   AA(Nvec) * 10.0
  print*, "AFTER  UPDATE AA=" , AA(Nvec) ; call flush(6)

END PROGRAM TEST_ASYNC
 
% pgf90 test_update.f90 -o test_update -V13.2 -ta=nvidia  ; test_update
 BEFORE UPDATE AA=    10.00000    
 AFTER  UPDATE AA=    10.00000    
% pgf90 test_update.f90 -o test_update -V13.3 -ta=nvidia  ; test_update
 BEFORE UPDATE AA=    10.00000    
 AFTER  UPDATE AA=    100.0000

Hello Mat .

Double Check … it doesn’t work correctly with pgi13.3

Has you, before submitting the bug, I have tested the wait directive with no success … Even with 2 or 3 wait after the update it doesn’t work …

I recheck the test, and trying your suggestion …
… the only way to make it work correctly/reproductively is to put a double udpate host before using the value in the host part ?!
( initialization on the host or print have no/marginal effect )

  !$acc update host(AA)
  !$acc update host(AA)
  AA(Nvec) =   AA(Nvec) * 10.0
  print*, "AFTER  UPDATE AA=" , AA(Nvec) ; call flush(6)

Compilation with this double update :

pgf90 --version -ta=nvidia,cc13,cc20,cuda5.0 -Minfo=ccff,all,intensity -Mprof=ccff test_update_mat.f90 -o test_update_mat_133_dble_update 2>&1 | egrep "pgf90|update"
     34, Generating update host(aa(:))
     35, Generating update host(aa(:))
pgf90 13.3-0 64-bit target on x86-64 Linux -tp nehalem

Run 10 times test_update_mat_133_dble_update

 for i in $( seq 10 ) ; do PGI_ACC_SYNCHRONOUS=0 test_update_mat_133_dble_update; done
 AFTER  UPDATE AA=    100.0000    
 AFTER  UPDATE AA=    100.0000    
 AFTER  UPDATE AA=    100.0000    
 AFTER  UPDATE AA=    100.0000    
 AFTER  UPDATE AA=    100.0000    
 AFTER  UPDATE AA=    100.0000    
 AFTER  UPDATE AA=    100.0000    
 AFTER  UPDATE AA=    100.0000    
 AFTER  UPDATE AA=    100.0000    
 AFTER  UPDATE AA=    100.0000

Compilation with one update

 pgf90 --version -ta=nvidia,cc13,cc20,cuda5.0 -Minfo=ccff,all,intensity -Mprof=ccff test_update_mat.f90 -o test_update_mat_133_one_update 2>&1 | egrep "pgf90|update"
     35, Generating update host(aa(:))
pgf90 13.3-0 64-bit target on x86-64 Linux -tp nehalem

Run 10 times test_update_mat_133_one_update

 for i in $( seq 10 ) ; do PGI_ACC_SYNCHRONOUS=0 test_update_mat_133_one_update; done
 AFTER  UPDATE AA=    10.00000    
 AFTER  UPDATE AA=    10.00000    
 AFTER  UPDATE AA=    10.00000    
 AFTER  UPDATE AA=    10.00000    
 AFTER  UPDATE AA=    10.00000    
 AFTER  UPDATE AA=    10.00000    
 AFTER  UPDATE AA=    10.00000    
 AFTER  UPDATE AA=    10.00000    
 AFTER  UPDATE AA=    10.00000    
 AFTER  UPDATE AA=    10.00000

Another point to show that the pgi13.3 is in fault, the same executable produce a cuMemcpyDtoHAsync error in a previous generation of GPU on a GXT280

 pgaccelinfo 
CUDA Driver Version:           5000
NVRM version: NVIDIA UNIX x86_64 Kernel Module  304.54  Sat Sep 29 00:05:49 PDT 2012

CUDA Device Number:            0
Device Name:                   GeForce GTX 280
Device Revision Number:        1.3
...
 for i in $( seq 10 ) ; do PGI_ACC_SYNCHRONOUS=0 test_update_mat_133_one_update; done
call to cuMemcpyDtoHAsync returned error 1: Invalid value
call to cuMemcpyDtoHAsync returned error 1: Invalid value
call to cuMemcpyDtoHAsync returned error 1: Invalid value
call to cuMemcpyDtoHAsync returned error 1: Invalid value
call to cuMemcpyDtoHAsync returned error 1: Invalid value
call to cuMemcpyDtoHAsync returned error 1: Invalid value
call to cuMemcpyDtoHAsync returned error 1: Invalid value
call to cuMemcpyDtoHAsync returned error 1: Invalid value
call to cuMemcpyDtoHAsync returned error 1: Invalid value
call to cuMemcpyDtoHAsync returned error 1: Invalid value

No problem with the double update on this platform to …

I post the simplified source code again to be sure we are doing the same test :

MODULE MODD_DATA
IMPLICIT NONE
  INTEGER, PARAMETER :: Nvec=2048
  REAL , ALLOCATABLE, DIMENSION(:) :: AA
  !$acc mirror(AA)
CONTAINS

  SUBROUTINE ALLOC_DATA_MODULE()
    IMPLICIT NONE
    ALLOCATE( AA(Nvec) )
  END SUBROUTINE ALLOC_DATA_MODULE

  SUBROUTINE INIT_DATA(XTAB,XVAL)
    IMPLICIT NONE
    REAL , DIMENSION(:)              :: XTAB   
    !$acc reflected (XTAB)
    REAL                             :: XVAL

    !$acc kernels
    XTAB = XVAL
    !$acc end kernels
 
  END SUBROUTINE INIT_DATA
END MODULE MODD_DATA

PROGRAM TEST_ASYNC
USE MODD_DATA
  IMPLICIT NONE

  CALL ALLOC_DATA_MODULE()

  CALL INIT_DATA(AA, 10.0 )

  !acc update host(AA)
  !$acc update host(AA)
  AA(Nvec) =   AA(Nvec) * 10.0
  print*, "AFTER  UPDATE AA=" , AA(Nvec) ; call flush(6)

END PROGRAM TEST_ASYNC

A+

Juan

Hi Juan,

I’m still not able to consistently reproduce this error leading me to believe that there is a UMR or synchronization issue going on. In 13.2 it fails only part of the time and in 13.3 it only fails for me when I compile to a generic CPU target (-tp=px-64). Though, that should be enough for our engineers to work with to determine the cause. I’ve added TPR#19269 to track this issue.

Another work around would be to move to using OpenACC data regions and the present clause:

$ cat test_update_mat2.f90 
MODULE MODD_DATA
IMPLICIT NONE
  INTEGER, PARAMETER :: Nvec=2048
  REAL , ALLOCATABLE, DIMENSION(:) :: AA
  !acc mirror(AA)
CONTAINS

  SUBROUTINE ALLOC_DATA_MODULE()
    IMPLICIT NONE
    ALLOCATE( AA(Nvec) )
  END SUBROUTINE ALLOC_DATA_MODULE

  SUBROUTINE INIT_DATA(XTAB,XVAL)
    IMPLICIT NONE
    REAL , DIMENSION(:)              :: XTAB   
    !acc reflected (XTAB)
    REAL                             :: XVAL

    !$acc kernels present(XTAB)
    XTAB = XVAL
    !$acc end kernels
 
  END SUBROUTINE INIT_DATA
END MODULE MODD_DATA

PROGRAM TEST_ASYNC
  USE MODD_DATA
  IMPLICIT NONE

  CALL ALLOC_DATA_MODULE()

!$acc data create(AA)

  CALL INIT_DATA(AA, 10.0 )

  !acc update host(AA)
  !$acc update host(AA)
  AA(Nvec) =   AA(Nvec) * 10.0
  print*, "AFTER  UPDATE AA=" , AA(Nvec) ; call flush(6)

!$acc end data

END PROGRAM TEST_ASYNC

Best Regards,
Mat

Hello Mat , thank you for the bug report …

;-) The create(AA, … ) directive is not really applicable in our code, it will become a directive spanning over 100 pages !

It’s definitively a ‘Warm-up/Foo’ problem …
With my previous version with mirror data ( ;-) waiting for device_resident … )
just adding an unused foo mirrored array a the beging of the main program solve the issue .

PROGRAM TEST_ASYNC
USE MODD_DATA
  IMPLICIT NONE

  INTEGER, ALLOCATABLE, DIMENSION(:) :: ifoo
  !$acc mirror(ifoo)
  ALLOCATE (ifoo(1))

...

;-) So now Next Level …

In my Original unit test the goal was to have a generic ALLOC_DATA_MODULE(), with an argument (XTAB) callable with data mirrored in data module=AA , something like this :

  SUBROUTINE ALLOC_DATA_MODULE(XTAB)
    IMPLICIT NONE
    REAL , ALLOCATABLE, DIMENSION(:) :: XTAB 
    !$acc declare mirror (XTAB)  

    ALLOCATE( XTAB(Nvec) )

  END SUBROUTINE ALLOC_DATA_MODULE

This doesn’t work , because the device part is deallocated at the exit of the alloc_data routine

 PGI_ACC_DEBUG=3 test_update_generic 2>&1 | egrep -n 'mirror.*xtab'
19:pgi_uacc_mirror_alloc(size=2048,elemsize=4,hosthandle=0x16172e0,lineno=13,name=xtab)
25:pgi_uacc_mirror_alloc(size=2048,elemsize=4,lineno=13,name=xtab) returns 0xb00300000
26:pgi_uacc_mirror_dealloc(ptr=0xb00300000,lineno=15,name=xtab)

Will this be possible with the futur OpenAcc directives like device_resident
( or later with enter/exit data in openacc 2.0 ? )

A+
Juan

Hi Juan,

( or later with enter/exit data in openacc 2.0 ? )

I think the non-structured data regions are the way to go.

I have a similar issue in a C benchmark where the data is malloc’d in an initialization routine, but I had to put an outer data region in the main program to get the device copy to span across the entire program. What I really wanted to do is couple the host malloc with the copy to the device, and with an “enter data” directive, I’ll be able to do this.

  • Mat

Ok , thank you Mat .

A+
Juan

Juan,

TPR 19269 is fixed in the current 13.5 release.

thanks for the report.

dave