Failure when using OpenACC after MPI_Init

I’ve a code where it fails if OpenACC functions are used after MPI_Init with this error:

Failing in Thread:0
call to cuInit returned error 304: Other

But if I make at least one OpenACC function call before the MPI_Init it works correctly.

This is using nvfortran 21.2 and OpenMPI 3.1.5, and compiled with:
mpif90 -acc=verystrict -Minfo=accel -gpu=managed,cc70 -O2 -gopt -cpp -mcmodel=medium -Mlarge_arrays -Kieee -fast -tp=px
on a v100 with Cuda 10.2 installed.

Is there anything I should know about using OpenACC with an MPI code?

I can provide the source if required.

Hi Adrian,

I typically delay using any OpenACC constructs until after I call MPI_Init so it’s unclear why this isn’t working correctly for you. Though I’ll use the following boiler plate code to set the device number so each rank uses a different device. Setting the device number is optional, but every rank would use the same default device without it.

I can provide the source if required.

That would be helpful in understanding the issue.

Here’s an example of what I typically do when using MPI+OpenACC. I’m using a system with 4 V100s.

% cat test_mpi_acc.f90
      PROGRAM test
      use mpi
      use openacc
      implicit none

      integer :: rank, world_size
      integer :: dev, devNum, local_rank, local_comm
      integer :: devtype, ierr
      integer, dimension(:), allocatable :: Arr
      integer :: asize, i

      call MPI_INIT(ierr)
      call MPI_COMM_SIZE(MPI_COMM_WORLD, world_size, ierr)
      call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)

      call MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, &
           MPI_INFO_NULL, local_comm,ierr)
      call MPI_Comm_rank(local_comm, local_rank,ierr)
      devtype = acc_get_device_type()
      devNum = acc_get_num_devices(devtype)
      dev = mod(local_rank,devNum)
      call acc_set_device_num(dev, devtype)
      dev = acc_get_device_num(devtype)
      print *, "Rank ", rank, " using Device ", dev, " out of ", devNum
      asize = 1024
      allocate(Arr(asize))
!$acc kernels loop copyout(Arr)
      do i=1,asize
         Arr(i) = i+rank
      enddo
      print *, "Rank ", rank, " A(10)=", Arr(10)
      deallocate(Arr)

      call MPI_FINALIZE(ierr)
      END PROGRAM

% mpif90 -V21.2 -acc -fast test_mpi_acc.f90                                                                                  
% mpirun -np 4 a.out
 Rank             3  using Device             3  out of             4
 Rank             0  using Device             0  out of             4
 Rank             2  using Device             2  out of             4
 Rank             1  using Device             1  out of             4
 Rank             0  A(10)=           10
 Rank             3  A(10)=           13
 Rank             2  A(10)=           12
 Rank             1  A(10)=           11
% mpirun -np 8 a.out
 Rank             0  using Device             0  out of             4
 Rank             1  using Device             1  out of             4
 Rank             3  using Device             3  out of             4
 Rank             6  using Device             2  out of             4
 Rank             5  using Device             1  out of             4
 Rank             7  using Device             3  out of             4
 Rank             2  using Device             2  out of             4
 Rank             4  using Device             0  out of             4
 Rank             0  A(10)=           10
 Rank             1  A(10)=           11
 Rank             3  A(10)=           13
 Rank             6  A(10)=           16
 Rank             5  A(10)=           15
 Rank             7  A(10)=           17
 Rank             4  A(10)=           14
 Rank             2  A(10)=           12

-Mat

Thanks, I’ll try your code as well.

This is the code I’ve been using (it’s not my code, it’s just some code I’ve taken from the internet that I’ve modified):

program matrix_multiply
!   use omp_lib
   use openacc
   use mpi
   implicit none
   integer :: i, j, k, myid, m, n, compiled_for, option
   integer, parameter :: fd = 11
   integer :: t1, t2, dt, count_rate, count_max
   real, allocatable, dimension(:,:) :: a, b, c
   real :: tmp, secs
   integer :: ngpus
   integer :: ierror, provided

!   call MPI_Init(ierror)

   open(fd,file='wallclocktime',form='formatted')

   option = compiled_for(fd) ! 1-serial, 2-OpenMP, 3-OpenACC, 4-both

   write(*,*) 'before acc_get_num_devices'
   ngpus = acc_get_num_devices( acc_device_nvidia )
   write(*,*) 'after acc_get_num_devices',ngpus

   call MPI_Init(ierror)

!$omp parallel
!$    myid = OMP_GET_THREAD_NUM()
!$    if (myid .eq. 0) then
!$      write(fd,"('Number of procs is ',i4)") OMP_GET_NUM_THREADS()
!$    endif
!$omp end parallel

   call system_clock(count_max=count_max, count_rate=count_rate)

   do m=1,4    ! compute for different size matrix multiplies

  call system_clock(t1)

  n = 1000*2**(m-1)    ! 1000, 2000, 4000, 8000
  allocate( a(n,n), b(n,n), c(n,n) )

! Initialize matrices
  do j=1,n
     do i=1,n
        a(i,j) = real(i + j)
        b(i,j) = real(i - j)
     enddo
  enddo

!$omp parallel do shared(a,b,c,n,tmp) reduction(+: tmp)
!$acc data copyin(a,b) copy(c)
!$acc kernels
! Compute matrix multiplication.
  do j=1,n
     do i=1,n
        tmp = 0.0  ! enables ACC parallelism for k-loop
        do k=1,n
           tmp = tmp + a(i,k) * b(k,j)
        enddo
        c(i,j) = tmp
     enddo
  enddo
!$acc end kernels
!$acc end data
!$omp end parallel do

  call system_clock(t2)
  dt = t2-t1
  secs = real(dt)/real(count_rate)
  write(fd,"('For n=',i4,', wall clock time is ',f12.2,' seconds')") &
          n, secs

  deallocate(a, b, c)

   enddo

  close(fd)

  call MPI_Finalize(ierror)

end program matrix_multiply

integer function compiled_for(fd)
implicit none
integer :: fd
#if defined _OPENMP && defined _OPENACC
  compiled_for = 4
  write(fd,"('This code is compiled with OpenMP & OpenACC')")
#elif defined _OPENACC
  compiled_for = 3
  write(fd,"('This code is compiled with OpenACC')")
#elif defined _OPENMP
  compiled_for = 2
  write(fd,"('This code is compiled with OpenMP')")
#else
  compiled_for = 1
  write(fd,"('This code is compiled for serial operations')")
#endif

end function compiled_for

Moving the MPI_Init between the two definitions breaks it on the system I’m using.

I’ve tested your code and it fails in the same way for me on our system. It fails with these kinds of errors:

bash-4.4$ srun -n 4 ./test_mpi_acc
Failing in Thread:0
call to cuInit returned error 304: Other

Failing in Thread:0
call to cuInit returned error 304: Other

Failing in Thread:0
call to cuInit returned error 304: Other

Failing in Thread:0
call to cuInit returned error 304: Other

srun: error: r2i7n1: tasks 0,2-3: Exited with exit code 1
srun: error: r2i7n1: task 1: Exited with exit code 1
bash-4.4$ mpirun -n 4 ./test_mpi_acc
Failing in Thread:0
call to cuInit returned error 304: Other

Failing in Thread:0
call to cuInit returned error 304: Other

Failing in Thread:0
call to cuInit returned error 304: Other

Failing in Thread:0
call to cuInit returned error 304: Other

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[33554,1],1]
  Exit code:    1
--------------------------------------------------------------------------

It could be we’ve got a strange setup or misconfiguration on the system.

I tested your code and it works fine for me, so yes, I think it’s something with your system config.

For SLURM systems that I’ve used, I’ve needed to add the “-G” option to select the number of GPUs to use. No idea on how your system is setup, but could this be the issue?

Thanks Mat, we’re using Slum GRES setting to specify the GPUs which should negate the need for the -G but I’ll give it a try.

Hi Mat,

The -G didn’t change anything.

Looking at what’s going on in strace it looks to me it’s failing at this point (although strace output isn’t straight forward to isolate issues from):

getpid()                                = 76525
stat("/proc/76525/ns/pid", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
stat("/proc/76525/ns/pid", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
socket(AF_UNIX, SOCK_SEQPACKET|SOCK_CLOEXEC, 0) = 63
unlink("")                              = -1 ENOENT (No such file or directory)
bind(63, {sa_family=AF_UNIX, sun_path=@"cuda-uvmfd-4026531836-76525\0"}, 31) = -1 EADDRINUSE (Address already in use)
close(63)                               = 0

Are there people at nVidia who would be able to comment on what’s going on here?

thanks