Problem with '!$acc update device' in omp+acc fortran code

Dear All,

Could you please help me to find the problem in the following compilable code.

module storage
implicit none
real(kind=8),allocatable,dimension(:)::A
real(kind=8)::B
!$acc declare create(A,B)
end module storage

program main
use omp_lib
use openacc
use storage
implicit none
integer::ngpu,N,myid

N=10
allocate(A(1:N))
A(1:N)=1.d0
ngpu = ACC_GET_NUM_DEVICES(acc_device_default)

call omp_set_num_threads(ngpu)
!$omp parallel default(shared) private(myid,B)
  myid = OMP_GET_THREAD_NUM()
  call acc_set_device_num(myid,acc_device_default)

!$acc update device(A) 

!$acc serial copyin(N) 
    B=sum(A(1:N))
!$acc update host(B)
!$acc end serial

    print*,'myid=',myid,'B=',B

!$omp end parallel

end program main

I would expect the following output

myid= 0 B= 10.00000000000000
myid= 1 B= 10.00000000000000
myid= 2 B= 10.00000000000000
myid= 3 B= 10.00000000000000

when run on machine with 4 GPUs. However, I am getting

myid= 0 B= 10.00000000000000
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 6.0, threadid=2
host:0x605d60 device:0x2aae8a400000 size:96 presentcount:1+1 line:-1 name:_storage_16
host:0xc12300 device:0x2aae8a600000 size:80 presentcount:1+0 line:16 name:a
allocated block device:0x2aae8a600000 size:512 thread:1
deleted block device:0x2aae8a600200 size:512 thread 1
Present table dump for device[4]: NVIDIA Tesla GPU 3, compute capability 6.0, threadid=2
host:0x605d60 device:0x2aae93c00000 size:96 presentcount:0+1 line:-1 name:_storage_16
FATAL ERROR: data in update device clause was not found on device 4: name=a
file:/group/d35/ilkhom/GPU/simple.f90 main line:25

Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 6.0, threadid=4
host:0x605d60 device:0x2aae8a400000 size:96 presentcount:1+1 line:-1 name:_storage_16
host:0xc12300 device:0x2aae8a600000 size:80 presentcount:1+0 line:16 name:a
allocated block device:0x2aae8a600000 size:512 thread:1
deleted block device:0x2aae8a600200 size:512 thread 1
Present table dump for device[2]: NVIDIA Tesla GPU 1, compute capability 6.0, threadid=4
host:0x605d60 device:0x2aae78c00000 size:96 presentcount:0+1 line:-1 name:_storage_16
Present table dump for device[4]: NVIDIA Tesla GPU 3, compute capability 6.0, threadid=4
host:0x605d60 device:0x2aae93c00000 size:96 presentcount:0+1 line:-1 name:_storage_16
FATAL ERROR: data in update device clause was not found on device 2: name=a
file:/group/d35/ilkhom/GPU/simple.f90 main line:25

Failing in Thread:3
call to cuModuleGetGlobal returned error 4: Deinitialized

Hi Ilkhom,

Variables in “declare” directives get created and initialized upon load of the program so will be created on the default device. Hence, when you enter the serial compute region, the variables are not present on the other GPUs. Also, you’ve created the global “B” but not each thread’s private “B” which is different. To fix, you need to delay creating the variables on the device until after each thread sets it’s device.

Note in experimenting with your code, it does appear that we have an issue (intermittent wrong values) when using OpenMP private variables in data regions but seems ok when the private variable (B) is copied as part of the compute region. I issued a problem report (TPR#25965) for this error.

% cat test.F90
module storage
implicit none
real(kind=8),allocatable,dimension(:)::A
real(kind=8)::B
end module storage

program main
use omp_lib
use openacc
use storage
implicit none
integer::ngpu,N,myid

N=10
allocate(A(1:N))
A(1:N)=1.d0
#ifdef _OPENACC
ngpu = ACC_GET_NUM_DEVICES(acc_device_default)
#else
ngpu=4
#endif

call omp_set_num_threads(ngpu)
!$omp parallel default(shared) private(myid,B)
  myid = OMP_GET_THREAD_NUM()
#ifdef _OPENACC
  call acc_set_device_num(myid,acc_device_default)
#endif

!$acc enter data create(A)
!$acc update device(A)
#ifdef FAILS
!$acc enter data create(B)
#endif

!$acc serial present(A) copyout(B)
    B=sum(A(1:N))
!$acc end serial

!$acc exit data delete(A)
#ifdef FAILS
!$acc update self(B)
!$acc exit data delete(B)
#endif
    print*,'myid=',myid,'B=',B
!$omp end parallel

end program main
% pgf90 -mp test.F90 -Minfo -ta=tesla:cc70; a.out
main:
     24, Parallel region activated
     30, Generating enter data create(a(:))
     31, Generating update device(a(:))
     36, Generating copyout(b)
         Generating present(a(:))
         Accelerator serial kernel generated
         Generating Tesla code
         37, !$acc do seq
     37, sum reduction inlined
     40, Generating exit data delete(a(:))
     46, Parallel region terminated
 myid=            0 B=    10.00000000000000
 myid=            3 B=    10.00000000000000
 myid=            1 B=    10.00000000000000
 myid=            2 B=    10.00000000000000

Hope this helps,
Mat

Dear Mat,

thanks for your reply. It is very desirable to be able to use !$acc declare create inside a module. That’s because in my main code (consisting from about 30 K lines) several arrays do not change after being calculated, but at the same time are heavily used by other subroutines later on. To prevent data movement I am trying to calculate and save these data on all GPUs for the lifetime of the code. !$acc declare create sounds just right for this purpose.

The following code worked for me. The only difference from the original code is I am now using call acc_set_device_num(myid*ngpu,acc_device_default) instead of call acc_set_device_num(myid,acc_device_default). I accidentally discovered that in our computing node not all 28 threads have access to GPUs. Only certan (0,4,8,12,16,20,24) threads have access.

a090/group/d35/ilkhom/GPU> cat test.f90
module storage 
implicit none 
real(kind=8),allocatable,dimension(:)::A 
real(kind=8)::B 
!$acc declare create(A,B) 
end module storage 

program main 
use omp_lib 
use openacc 
use storage 
implicit none 
integer::ngpu,N,myid 

N=10 
allocate(A(1:N)) 
A(1:N)=1.d0 
ngpu = ACC_GET_NUM_DEVICES(acc_device_default) 

call omp_set_num_threads(ngpu) 
!$omp parallel default(shared) private(myid,B) 
  myid = OMP_GET_THREAD_NUM() 
  call acc_set_device_num(myid*ngpu,acc_device_default) 

!$acc update device(A) 

!$acc serial copyin(N) 
    B=sum(A(1:N)) 
!$acc update host(B) 
!$acc end serial 

    print*,'myid=',myid,'B=',B 

!$omp end parallel 

end program main
a090/group/d35/ilkhom/GPU> pgf90 -mp test.f90 -Minfo -ta=tesla:cc60 ; ./a.out
main:
     21, Parallel region activated
     25, Generating update device(a(:))
     27, Generating copyin(n)
         Accelerator serial kernel generated
         Generating Tesla code
         28, !$acc do seq
     28, sum reduction inlined
     34, Parallel region terminated
 myid=            0 B=    10.00000000000000     
 myid=            1 B=    10.00000000000000     
 myid=            3 B=    10.00000000000000     
 myid=            2 B=    10.00000000000000

Variables in “declare” directives get created and initialized upon load >>of the program so will be created on the default device.

Isn’t it possible to use “declare” directives for all available (not default) GPUs.

Hi Ilkhom,

The code “works” only because all the OpenMP threads are using the same device. If you pass “acc_set_device_num” a value larger than the available number of devices, the value will be mod’ed. Hence passing in “myid*ngpu” will actually use device 0 for all OpenMP threads.

If you used different devices,then the code would fail since “A” is only on device 0.

>>Variables in "declare" directives get created and initialized upon load >>of the program so will be created on the default device.

I should be more precise. When an allocatable array is used in a “declare create” directive, it’s allocated on the device at the same time it’s allocated on the host. Given no device number set at the time “A” is allocated, the default (device 0) is used.

Isn’t it possible to use “declare” directives for all available (not default) GPUs.

No, sorry. It simply doesn’t work this way. If you are managing the device data, then you need to manage it separately for each device.

What you may want to try is using CUDA Unified Memory instead (-ta=tesla:managed). While the physical memory is still separate, the address space is unified so every GPU can access the same address.

For example:

% cat test1.F90
module storage
implicit none
real(kind=8),allocatable,dimension(:)::A
real(kind=8)::B
end module storage

program main
use omp_lib
use openacc
use storage
implicit none
integer::ngpu,N,myid

N=10
allocate(A(1:N))
A(1:N)=1.d0
ngpu = ACC_GET_NUM_DEVICES(acc_device_default)

call omp_set_num_threads(ngpu)
!$omp parallel default(shared) private(myid,B)
myid = OMP_GET_THREAD_NUM()
print *, myid, myid
call acc_set_device_num(myid,acc_get_device_type())

!$acc serial
B=sum(A(1:N))
!$acc end serial

print*,‘myid=’,myid,‘B=’,B

!$omp end parallel

end program main
% pgf90 -mp test1.F90 -Minfo -ta=tesla:cc70,managed ; ./a.out
main:
20, Parallel region activated
25, Generating copyin(n)
Generating implicit copyin(a(1:10))
Accelerator serial kernel generated
Generating Tesla code
26, !$acc do seq
26, sum reduction inlined
31, Parallel region terminated
0 0
2 2
3 3
1 1
myid= 0 B= 10.00000000000000
myid= 2 B= 10.00000000000000
myid= 3 B= 10.00000000000000
myid= 1 B= 10.00000000000000

On a side note, you can’t put update directives inside of compute regions. Hence the update of B, is being ignored.

Also, while it doesn’t hurt, there’s no need to copyin N. Scalars are firstprivate by default with the initial value being passed in as an argument to the kernel call.

-Mat

Hi Mat,

thanks for clarifying things out. I have a different question then. Is it possible to save some data to multiple GPUs in the beginning of the code so that later on all GPUs can access these data at different parts of the code (i.e. inside functions called within !$acc parallel regions)?

I thought ‘!$acc declare create’ directives inside a module would allow to achieve this goal. Though it does not necessarily have to be this way. If you have any other suggestion please let me know.

Cheers,
Ilkhom

Is it possible to save some data to multiple GPUs in the beginning of the code so that later on all GPUs can access these data at different parts of the code (i.e. inside functions called within !$acc parallel regions)?

Yes, you can create data on different devices and then access this data later in different parts of the code.

Note that since OpenMP threads will retain their device assignment between different parallel regions.

I thought ‘!$acc declare create’ directives inside a module would allow to achieve this goal. Though it does not necessarily have to be this way. If you have any other suggestion please let me know.

A “declare” directive is just a data region who’s scope and lifetime match the program unit in which it’s used.

I’m curious where you got the impression that declare would implicitly allocate on multiple GPUs? Is there something in our or the OpenACC documentation that could be clarified?

Thanks,
Mat

Hi Mat,

thanks for your reply.

I’m curious where you got the impression that declare would implicitly allocate on multiple GPUs? Is there something in our or the OpenACC documentation that could be clarified?

I got confused. In our MPI+OMP hybrid code we do such tricks. Module variables are global within a node with shared memory and communications between nodes is achieved via MPI.

Anyway, as per you suggestion I am trying to create and update arrays on GPUs from host using the enter data directive. However, couldn’t cracked it yet.
The following code produces

FATAL ERROR: data in update device clause was not found on device 3: name=c
file:/group/d35/ilkhom/GPU/test2.f90 main line:34

I can’t figure out why.

module storage
implicit none
complex(kind=8),allocatable,dimension(:,:)::A,C
complex(kind=8)::B
end module storage

program main
use omp_lib
use openacc
use storage
implicit none
integer::ngpu,N,myid,i,j
complex(kind=8)::sumA
complex(kind=8)::solve

N=10
allocate(A(1:N,1:N))
allocate(C(1:N,1:N))
do i=1,N
  do j=1,N
    A(i,j)=dcmplx(1.d0, 1.d0)
    C(i,j)=dcmplx(1.d0, 1.d0)
  enddo
enddo
ngpu = ACC_GET_NUM_DEVICES(acc_device_nvidia)
call omp_set_num_threads(ngpu)
!$omp parallel default(shared) private(myid,B)
  myid = OMP_GET_THREAD_NUM()
  call acc_set_device_num(myid,acc_device_nvidia)

!$acc enter data create(A,C)
!$acc update device(A,C)

    B=sumA(N)

!$acc exit data delete(A,C)
    print*,'myid=',myid,'B=',B
!$omp end parallel

end program main

function sumA(N)
use openacc
use storage
implicit none
integer::N
complex(kind=8)::sumA

!$acc kernels present(A,C)
    sumA=sum(A(1:N,1:N))+sum(C(1:N,1:N))
!$acc end kernels

end function sumA

Hi Ilkhom,

In our MPI+OMP hybrid code we do such tricks. Module variables are global within a node with shared memory and communications between nodes is achieved via MPI.

No worries. I just wanted to make sure there wasn’t something in our documentation that needs to be clarified.

Note that I generally recommend using MPI+OpenACC for multi-gpu programming instead of OpenMP+OpenACC. It’s more straight forward since there’s a one-to-one association between the rank and the GPU as opposed to the one-to-many with OpenMP. Trying to manage data across multiple gpus using one host process with many threads is tricky. Plus some MPIs, such as OpenMPI, include GPUdirect where communication can be done between GPUs rather than having to bring data back to the host.

I wrote this article awhile ago on using MPI+OpenACC in Fortran, but it still may be useful: PGI Documentation Archive for Versions Prior to 17.7

There’s also this course: https://developer.nvidia.com/openacc-advanced-course. The examples use C, but the info applies to Fortran as well. It also covers CUDA Aware MPI / GPUDirect.


As for this error, it looks like a bug in the interaction of OpenACC with our older OpenMP runtime when using more than 2 threads. I’ve submitted an issue report (TPR#25976) and sent it to our engineers for investigation.

The good news is that the example works with our newer LLVM based OpenMP runtime. Which version of the compilers are you using? With 18.4, the LLVM compilers are co-installed so you can either set your PATH to “$PGI/linux86-64-llvm/18.4/bin” or add the “-Mllvm” flag.

% pgf90 -ta=tesla:cc70 -mp test.2.F90 -Minfo=accel -Vdev -Mllvm; a.out
main:
     31, Generating enter data create(c(:,:),a(:,:))
     32, Generating update device(c(:,:),a(:,:))
     36, Generating exit data delete(c(:,:),a(:,:))
suma:
      0, Accelerator kernel generated
         Generating Tesla code
     49, Generating present(a(:,:),c(:,:))
     50, Loop is parallelizable
         Accelerator serial kernel generated
         Accelerator kernel generated
         Generating Tesla code
         50, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
             !$acc loop gang, vector(32) ! blockidx%x threadidx%x
             Generating implicit reduction(other:a$r)
 myid=            3 B=  (200.0000000000000,200.0000000000000)
 myid=            1 B=  (200.0000000000000,200.0000000000000)
 myid=            2 B=  (200.0000000000000,200.0000000000000)
 myid=            0 B=  (200.0000000000000,200.0000000000000)
% a.out
 myid=            2 B=  (200.0000000000000,200.0000000000000)
 myid=            3 B=  (200.0000000000000,200.0000000000000)
 myid=            0 B=  (200.0000000000000,200.0000000000000)
 myid=            1 B=  (200.0000000000000,200.0000000000000)
% a.out
 myid=            3 B=  (200.0000000000000,200.0000000000000)
 myid=            0 B=  (200.0000000000000,200.0000000000000)
 myid=            2 B=  (200.0000000000000,200.0000000000000)
 myid=            1 B=  (200.0000000000000,200.0000000000000)

Hope this helps,
Mat

As for this error, it looks like a bug in the interaction of OpenACC with our older OpenMP runtime when using more than 2 threads. I’ve submitted an issue report (TPR#25976)



it does appear that we have an issue (intermittent wrong values) when using OpenMP private variables in data regions but seems ok when the private variable (B) is copied as part of the compute region. I issued a problem report (TPR#25965) for this error.

Both of these should be fixed with release 18.7