Failure when using OpenACC after MPI_Init

a.jackson · April 23, 2021, 11:19am

I’ve a code where it fails if OpenACC functions are used after MPI_Init with this error:

Failing in Thread:0
call to cuInit returned error 304: Other

But if I make at least one OpenACC function call before the MPI_Init it works correctly.

This is using nvfortran 21.2 and OpenMPI 3.1.5, and compiled with:
mpif90 -acc=verystrict -Minfo=accel -gpu=managed,cc70 -O2 -gopt -cpp -mcmodel=medium -Mlarge_arrays -Kieee -fast -tp=px
on a v100 with Cuda 10.2 installed.

Is there anything I should know about using OpenACC with an MPI code?

I can provide the source if required.

MatColgrove · April 23, 2021, 2:49pm

Hi Adrian,

I typically delay using any OpenACC constructs until after I call MPI_Init so it’s unclear why this isn’t working correctly for you. Though I’ll use the following boiler plate code to set the device number so each rank uses a different device. Setting the device number is optional, but every rank would use the same default device without it.

I can provide the source if required.

That would be helpful in understanding the issue.

Here’s an example of what I typically do when using MPI+OpenACC. I’m using a system with 4 V100s.

% cat test_mpi_acc.f90
      PROGRAM test
      use mpi
      use openacc
      implicit none

      integer :: rank, world_size
      integer :: dev, devNum, local_rank, local_comm
      integer :: devtype, ierr
      integer, dimension(:), allocatable :: Arr
      integer :: asize, i

      call MPI_INIT(ierr)
      call MPI_COMM_SIZE(MPI_COMM_WORLD, world_size, ierr)
      call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)

      call MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, &
           MPI_INFO_NULL, local_comm,ierr)
      call MPI_Comm_rank(local_comm, local_rank,ierr)
      devtype = acc_get_device_type()
      devNum = acc_get_num_devices(devtype)
      dev = mod(local_rank,devNum)
      call acc_set_device_num(dev, devtype)
      dev = acc_get_device_num(devtype)
      print *, "Rank ", rank, " using Device ", dev, " out of ", devNum
      asize = 1024
      allocate(Arr(asize))
!$acc kernels loop copyout(Arr)
      do i=1,asize
         Arr(i) = i+rank
      enddo
      print *, "Rank ", rank, " A(10)=", Arr(10)
      deallocate(Arr)

      call MPI_FINALIZE(ierr)
      END PROGRAM

% mpif90 -V21.2 -acc -fast test_mpi_acc.f90                                                                                  
% mpirun -np 4 a.out
 Rank             3  using Device             3  out of             4
 Rank             0  using Device             0  out of             4
 Rank             2  using Device             2  out of             4
 Rank             1  using Device             1  out of             4
 Rank             0  A(10)=           10
 Rank             3  A(10)=           13
 Rank             2  A(10)=           12
 Rank             1  A(10)=           11
% mpirun -np 8 a.out
 Rank             0  using Device             0  out of             4
 Rank             1  using Device             1  out of             4
 Rank             3  using Device             3  out of             4
 Rank             6  using Device             2  out of             4
 Rank             5  using Device             1  out of             4
 Rank             7  using Device             3  out of             4
 Rank             2  using Device             2  out of             4
 Rank             4  using Device             0  out of             4
 Rank             0  A(10)=           10
 Rank             1  A(10)=           11
 Rank             3  A(10)=           13
 Rank             6  A(10)=           16
 Rank             5  A(10)=           15
 Rank             7  A(10)=           17
 Rank             4  A(10)=           14
 Rank             2  A(10)=           12

-Mat

a.jackson · April 23, 2021, 2:53pm

Thanks, I’ll try your code as well.

This is the code I’ve been using (it’s not my code, it’s just some code I’ve taken from the internet that I’ve modified):

program matrix_multiply
!   use omp_lib
   use openacc
   use mpi
   implicit none
   integer :: i, j, k, myid, m, n, compiled_for, option
   integer, parameter :: fd = 11
   integer :: t1, t2, dt, count_rate, count_max
   real, allocatable, dimension(:,:) :: a, b, c
   real :: tmp, secs
   integer :: ngpus
   integer :: ierror, provided

!   call MPI_Init(ierror)

   open(fd,file='wallclocktime',form='formatted')

   option = compiled_for(fd) ! 1-serial, 2-OpenMP, 3-OpenACC, 4-both

   write(*,*) 'before acc_get_num_devices'
   ngpus = acc_get_num_devices( acc_device_nvidia )
   write(*,*) 'after acc_get_num_devices',ngpus

   call MPI_Init(ierror)

!$omp parallel
!$    myid = OMP_GET_THREAD_NUM()
!$    if (myid .eq. 0) then
!$      write(fd,"('Number of procs is ',i4)") OMP_GET_NUM_THREADS()
!$    endif
!$omp end parallel

   call system_clock(count_max=count_max, count_rate=count_rate)

   do m=1,4    ! compute for different size matrix multiplies

  call system_clock(t1)

  n = 1000*2**(m-1)    ! 1000, 2000, 4000, 8000
  allocate( a(n,n), b(n,n), c(n,n) )

! Initialize matrices
  do j=1,n
     do i=1,n
        a(i,j) = real(i + j)
        b(i,j) = real(i - j)
     enddo
  enddo

!$omp parallel do shared(a,b,c,n,tmp) reduction(+: tmp)
!$acc data copyin(a,b) copy(c)
!$acc kernels
! Compute matrix multiplication.
  do j=1,n
     do i=1,n
        tmp = 0.0  ! enables ACC parallelism for k-loop
        do k=1,n
           tmp = tmp + a(i,k) * b(k,j)
        enddo
        c(i,j) = tmp
     enddo
  enddo
!$acc end kernels
!$acc end data
!$omp end parallel do

  call system_clock(t2)
  dt = t2-t1
  secs = real(dt)/real(count_rate)
  write(fd,"('For n=',i4,', wall clock time is ',f12.2,' seconds')") &
          n, secs

  deallocate(a, b, c)

   enddo

  close(fd)

  call MPI_Finalize(ierror)

end program matrix_multiply

integer function compiled_for(fd)
implicit none
integer :: fd
#if defined _OPENMP && defined _OPENACC
  compiled_for = 4
  write(fd,"('This code is compiled with OpenMP & OpenACC')")
#elif defined _OPENACC
  compiled_for = 3
  write(fd,"('This code is compiled with OpenACC')")
#elif defined _OPENMP
  compiled_for = 2
  write(fd,"('This code is compiled with OpenMP')")
#else
  compiled_for = 1
  write(fd,"('This code is compiled for serial operations')")
#endif

end function compiled_for

Moving the MPI_Init between the two definitions breaks it on the system I’m using.

a.jackson · April 23, 2021, 2:56pm

I’ve tested your code and it fails in the same way for me on our system. It fails with these kinds of errors:

bash-4.4$ srun -n 4 ./test_mpi_acc
Failing in Thread:0
call to cuInit returned error 304: Other

Failing in Thread:0
call to cuInit returned error 304: Other

Failing in Thread:0
call to cuInit returned error 304: Other

Failing in Thread:0
call to cuInit returned error 304: Other

srun: error: r2i7n1: tasks 0,2-3: Exited with exit code 1
srun: error: r2i7n1: task 1: Exited with exit code 1
bash-4.4$ mpirun -n 4 ./test_mpi_acc
Failing in Thread:0
call to cuInit returned error 304: Other

Failing in Thread:0
call to cuInit returned error 304: Other

Failing in Thread:0
call to cuInit returned error 304: Other

Failing in Thread:0
call to cuInit returned error 304: Other

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[33554,1],1]
  Exit code:    1
--------------------------------------------------------------------------

a.jackson · April 23, 2021, 3:01pm

It could be we’ve got a strange setup or misconfiguration on the system.

MatColgrove · April 23, 2021, 3:05pm

I tested your code and it works fine for me, so yes, I think it’s something with your system config.

For SLURM systems that I’ve used, I’ve needed to add the “-G” option to select the number of GPUs to use. No idea on how your system is setup, but could this be the issue?

a.jackson · April 23, 2021, 3:08pm

Thanks Mat, we’re using Slum GRES setting to specify the GPUs which should negate the need for the -G but I’ll give it a try.

a.jackson · April 23, 2021, 4:00pm

Hi Mat,

The -G didn’t change anything.

Looking at what’s going on in strace it looks to me it’s failing at this point (although strace output isn’t straight forward to isolate issues from):

getpid()                                = 76525
stat("/proc/76525/ns/pid", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
stat("/proc/76525/ns/pid", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
socket(AF_UNIX, SOCK_SEQPACKET|SOCK_CLOEXEC, 0) = 63
unlink("")                              = -1 ENOENT (No such file or directory)
bind(63, {sa_family=AF_UNIX, sun_path=@"cuda-uvmfd-4026531836-76525\0"}, 31) = -1 EADDRINUSE (Address already in use)
close(63)                               = 0

Are there people at nVidia who would be able to comment on what’s going on here?

thanks

Topic		Replies	Views
An error occurred when using MPI and OpenACC together nvc, nvc++ and nvfortran	10	1246	April 26, 2023
OpenACC usage inside OpenMP constructs Legacy PGI Compilers	6	3981	August 26, 2019
Running nvidia Fortran on multiple GPUs with MPI nvc, nvc++ and nvfortran	36	1043	December 14, 2024
OpenMP, OpenACC and acc_set_device_num Legacy PGI Compilers	12	11020	March 15, 2013
combine the OpenMP with the OpenACC Legacy PGI Compilers	5	5533	April 22, 2014
Using multiple GPUs Legacy PGI Compilers	7	22218	August 11, 2009
Setting the Desired Device nvc, nvc++ and nvfortran	3	1987	April 26, 2022
accelerate a single loop with mpi and gpu Legacy PGI Compilers	21	16292	July 19, 2013
MPI send + OpenACC + acc_malloc fail with NVFortran, but work with C nvc, nvc++ and nvfortran	10	253	September 6, 2024
Is there a general template to write hybrid MPI and openACC? Legacy PGI Compilers	9	1113	November 18, 2020

Failure when using OpenACC after MPI_Init

Related topics