Multiple GPUs with mirror and update clauses

Hi,

I am having issues when attempting to use the mirror and update device clauses with multiple GPUs. It seems as though only the first GPU is aware of the data that was reflected in a previous routine. This is also true of the initialisation of the second GPU.

More details:

I was recently getting the error: “Fatal Usage Error: __pgi_cu_mirrordealloc called before __pgi_cu_init” at execution time. When I remove the deallocation (which isn’t strictly needed) then I got the similar error: “Fatal Usage Error: __pgi_cu_mirroralloc called before __pgi_cu_init”. This is associated with the allocation of an array that is updated using the !$acc update(passed2) clause after being defined as mirrored in a separate module and the problem only occurs now that I am trying to run the code across two OpenMP threads.

Further tweaking showed that despite !$acc_init getting called in an OpenMP region within an earlier subroutine this doesn’t seem to have been passed on to this routine. Adding an !$acc_init to this routine has removed the error described above but replaced
it with the following error at compile time:

PGF90-S-0155-UPDATE clause requires a visible device copy for symbol passed2 (intega.f: 27998)

This error actually seems to be related to the specification of the passed2 array as being private for the OpenMP region.

Thanks for taking a look,

Karl

Hi Karl,

I’ve used mirrored in an OpenMP program but the variable needs to private and only allocated after the program has entered a parallel region. A mirrored shared variable isn’t yet supported.

Though, I have never seen the specific errors you’re getting. Can you write a small reproducing example?

Thanks,
Mat

Hi Mat,

I haven’t been able to replicate the issue within a smaller piece of sample code I’m afraid.

I recently tried to bypass the issue by moving the code into a separate routine that is called from within the OpenMP region.

However, this results in some behaviour I would consider quite strange: My understanding is that the variables within a subroutine that is called from an OpenMP region are intrinsically private (unless specified otherwise). Unfortunately this does not seem to be the case as I am getting errors that can be corrected by specifying the relevant variables as private.

Am I missing something simple here?

Cheers,

Karl

Hi Karl,

My understanding is that the variables within a subroutine that is called from an OpenMP region are intrinsically private

Correct. Variables declared locally within a subroutine are implicitly private if the subroutine is called within an OpenMP parallel region. Hence, I suspect something else is going on so would need more details.

Below is a small example program. Can you modify it so that it replicates the behavior you are seeing?

% cat mirror.f90 

program test
  use omp_lib
  implicit none
  integer i,thd,nthd
  
!$omp parallel do 
  do i=1,32
     call testme(i)
  enddo  

end program test

subroutine testme (i)
  use omp_lib
#ifdef _ACCEL
  use accel_lib
#endif
  implicit none
  integer :: i, ii
  integer :: thd
  real, dimension(:), allocatable :: arr
!$acc mirror(arr)
  thd = omp_get_thread_num()
#ifdef _ACCEL
  call acc_set_device_num(thd, ACC_DEVICE_NVIDIA)
#endif
  allocate(arr(32)) 
  arr=0
!$acc region
  do ii=1,32 
    arr(ii) = real(i) / (thd+ii)
  end do
!$acc end region
!$acc update host (arr)
  print *, thd, i, sum(arr)
end subroutine testme

% pgf90 -mp -Mpreprocess -Minfo=mp,accel mirror.f90 -ta=nvidia
test:
      7, Parallel region activated
      8, Parallel loop activated with static block schedule
     10, Parallel region terminated
testme:
     23, Generating local(arr(:))
     30, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     31, Loop is parallelizable
         Accelerator kernel generated
         31, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
             CC 1.0 : 11 registers; 48 shared, 40 constant, 0 local memory bytes; 33% occupancy
             CC 2.0 : 11 registers; 8 shared, 68 constant, 0 local memory bytes; 16% occupancy
     35, Generating !$acc update host(arr(:))
% setenv OMP_NUM_THREADS 4
% a.out
            3           25    57.83620    
            0            1    4.058496    
            2           17    44.50957    
            1            9    27.79918    
            3           26    60.14965    
            0            2    8.116991    
            2           18    47.12778    
            1           10    30.88798    
            3           27    62.46309    
            0            3    12.17549    
            2           19    49.74599    
            1           11    33.97678    
            3           28    64.77655    
            0            4    16.23398    
            2           20    52.36419    
            1           12    37.06558    
            3           29    67.09000    
            0            5    20.29248    
            2           21    54.98241    
            1           13    40.15438    
            3           30    69.40344    
            0            6    24.35097    
            2           22    57.60062    
            1           14    43.24318    
            3           31    71.71688    
            0            7    28.40947    
            2           23    60.21883    
            1           15    46.33197    
            3           32    74.03034    
            0            8    32.46796    
            2           24    62.83704    
            1           16    49.42077
  • Mat