MPI mixing host and gpu devices with PGI accelerator

franzisko · December 1, 2011, 5:18pm

Hello,

I have an MPI program running in CPU or GPU accelerated version. I would like to have some processes running on host CPUs and other processes running on GPUs. So I try to compile my program using -ta=nvidia,host and I call acc_set_device(–) with the argument 0 or 3 according to process rank.

But when I run the executable I get

Usage error: multiple calls to acc_set_device with different device types

Did I make any mistake or is it not allowed? If not, will it be allowed in a next version and what is a possibile workaround avoiding duplacting all my subroutines?

thanks
Francesco

MatColgrove · December 2, 2011, 12:25am

Hi Francesco,

I think you’re meaning to call “acc_set_device_num” which sets the device number to use versus “acc_set_device” which sets the device type.

Hope this helps,
Mat

franzisko · December 2, 2011, 9:09am

Hi Mat,

no, I would like to have:

n_1 MPI processes running on n_1 host cores
n_2 MPI processes running on n_2 devices

where n=n_1+n_2 is the toal amount of MPI processes.
Hence, I think I need to specify acc_set_device. So, I can avoid duplicating subroutines for host or device compilation and use the twofold compilation -ta=nvidia,host provided by PGI compiler.

Is it possible?

thanks,
Francesco

MatColgrove · December 5, 2011, 7:12pm

Hi Francesco,

The error above suggests that you are calling acc_set_device multiple times using different device types. You can use acc_set_device, but it can only be called once per MPI process.

Can you post an example of your code?

Thanks,
Mat

franzisko · December 7, 2011, 2:22pm

Hi Mat,

here a simplified test case, performing a vector addition:

main.F90

program vector_add_test

#ifdef _ACCEL
use accel_lib
#endif
use mpi
use storage
implicit none

call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world,n_rank,ierr)
call mpi_comm_size(mpi_comm_world,n_proc,ierr)

GPU = .false.

#ifdef _ACCEL
print*,'NEW n_rank,NCOREXNODE: ',n_rank,NCOREXNODE
if(mod(n_rank,NCOREXNODE) .lt. NGPUXNODE) then
  call acc_set_device(acc_device_nvidia)
  device_kind = acc_get_device()
  print*,'Selected device kind:  ',device_kind
  num_devices = acc_get_num_devices(acc_device_nvidia)
  print*,'Number of devices available: ',num_devices
  call acc_set_device_num(mod(n_rank,NCOREXNODE),acc_device_nvidia)
  print*,'n_rank: ',n_rank,' tries to set GPU: ',mod(n_rank,NCOREXNODE)
  my_device = acc_get_device_num(acc_device_nvidia)
  print*,'n_rank: ',n_rank,' is using device: ',my_device
  print*,'Set GPU to true for rank: ',n_rank
  GPU = .true.
else
  call acc_set_device(0)
endif
#endif
call MPI_BARRIER(MPI_COMM_WORLD,ierr)
print*,'ciao'
if(GPU) then ; n = n_long ; else ; n = n_short ; endif

call allocate_storage()
!allocate(a(n),b(n),c(n))

call random(a)  ;  call random(b)
if(GPU) then
print*,'updating device'
!$acc update device(a,b)
endif

call vector_add()
if(GPU) then
print*,'updating host'
!$acc update host(c)
endif
print*,'GPU?: ',GPU,' c(5): ',c(5)

call mpi_finalize(ierr)

end program vector_add_test

allocate_storage.f90

subroutine allocate_storage()

use storage
implicit none
integer :: i

allocate(a(n),b(n),c(n))

end subroutine allocate_storage

storage.f90

module storage

use accel_lib

integer, parameter :: myk = kind(1.d0)
real(myk), allocatable, dimension(:) :: a,b,c

integer, parameter :: NCOREXNODE=12
integer, parameter :: NGPUXNODE=2
integer :: num_devices
integer(acc_device_kind) :: my_device,device_kind
  
integer :: n_long=1000,n_short=150,n
integer :: ierr,n_rank,n_proc
  
logical :: GPU
  
!$acc mirror(a,b,c)
  
end module storage

vector_add.f90

subroutine vector_add()

use storage
implicit none
integer :: i

!$acc region
do i=1,n 
   c(i) = a(i) + b(i) 
enddo
!$acc end region
  
end subroutine vector_add

I compile using:

mpif90 -ta=nvidia,cc20,cuda4.0,host storage.f90 allocate_storage.f90 main.F90 vector_add.f90

Every MPI process should select the device type (host or nvidia) and, in case of GPU, the device number. It is now set on a node with 12 cores and 2 GPUs: the first 2 use GPUs while the other 10 use CPUs.

The error at runtime is:

Usage error: multiple calls to acc_set_device with different device types

Another curious error. If I allocate without calling a subroutine but inserting allocate(a(n),b(n),c(n)) in the main program I cannot use the second GPU (even for GPU only runs) because the first one seems to be automatically initialized and I do not know how to avoid it.

thanks a lot,
Francesco

MatColgrove · December 7, 2011, 7:49pm

Hi Francesco,

This appears to be a problem with the “mirror” directive not being ignored in the host context. I have written a report (TPR#18352) and sent it to our engineers for further investigation.

If you remove the “mirror” and “update” directives, then the code works as expected. Hopefully you can use this as a work around until we get this fixed.

Thank you for the report!
Mat

Topic		Replies	Views
Using multiple GPUs Legacy PGI Compilers	7	22081	August 11, 2009
Multi-GPU MPI launch failing when UVM enabled Legacy PGI Compilers	5	3777	January 2, 2019
problem with multi gpu using mpi Legacy PGI Compilers	2	2178	December 2, 2015
unsupported device type Legacy PGI Compilers	6	8992	July 30, 2010
using all 4 GPUs in S1070 from multi-core cpu? how CUDA Programming and Performance	11	32416	December 13, 2010
How used my four gpu node Legacy PGI Compilers	6	4622	April 21, 2018
Unified Memory Problem nvc, nvc++ and nvfortran	12	1190	January 12, 2022
Question about CUDA+MPI Legacy PGI Compilers	3	2627	March 13, 2018
Invalid Device when using open mpi to run multiple processes Legacy PGI Compilers	1	2438	August 4, 2017
About two or more GPUs Legacy PGI Compilers	6	7151	July 31, 2012

MPI mixing host and gpu devices with PGI accelerator

Related topics