MPI send + OpenACC + acc_malloc fail with NVFortran, but work with C

Dear all,

I am trying to use MPI communications (with MPI Cuda Aware, NVFortran 24.7-0, on a dual GPU node) with OpenACC, more specifically with device buffer variables allocated using acc_malloc. Unfortunately, the test failed, in particular, the resulting error complains that the buffer of send/receive is not mapped on devices.

Below I report the complete minimal test, it is simple:

  1. initialize MPI and devices environment;
  2. allocate buffer_dev on devices using acc_malloc (through iso_c_binding interface) ;
  3. populate buffer_dev using openacc loop with different values on each device;
  4. do MPI send/receive;
  5. check the result of the communication;

Unfortunately, the send instruction raises an error: Caught signal 11 (Segmentation fault: address not mapped to object at address….

This test tries to mimic the C example posted in this old thread:

Here is my Fortran test:

program test_deviceptr_mpi
use :: iso_c_binding
use :: mpi
use :: openacc

implicit none

integer                   :: sizes(3)=[1,2,3]  ! arrays sizes
real, pointer             :: buffer_dev(:,:,:) ! device work array
real, allocatable, target :: buffer_hos(:,:,:) ! host work array
type(c_ptr)               :: cptr              ! c-pointer
integer(c_size_t)         :: bytes             ! number of bytes of arryas
integer                   :: ierr              ! error status
integer                   :: procs_number      ! MPI processes number
integer                   :: myrank            ! MPI current ID
character(:), allocatable :: myrankstr         ! MPI ID stringified
integer                   :: local_comm        ! MPI local communicator
integer                   :: local_rank        ! local MPI split ID
integer                   :: devices_number    ! devices number
integer                   :: mydev             ! device current ID
integer                   :: i, j, k           ! counters

interface
   function acc_malloc_f(total_byte_dim) bind(c, name="acc_malloc")
   use iso_c_binding, only : c_ptr, c_size_t
   implicit none
   type(c_ptr)                          :: acc_malloc_f
   integer(c_size_t), value, intent(in) :: total_byte_dim
   endfunction acc_malloc_f

   subroutine acc_memcpy_from_device_f(host_ptr, dev_ptr, total_byte_dim) bind(c, name="acc_memcpy_from_device")
   use iso_c_binding, only : c_ptr, c_size_t
   implicit none
   type(c_ptr),       value :: host_ptr
   type(c_ptr),       value :: dev_ptr
   integer(c_size_t), value :: total_byte_dim
   endsubroutine acc_memcpy_from_device_f
endinterface

! initialize MPI and devices env
call MPI_INIT(ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, procs_number, ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr)
call MPI_COMM_SPLIT_TYPE(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, local_comm, ierr)
call MPI_COMM_RANK(local_comm, local_rank, ierr)
myrankstr = repeat(' ',5)
write(myrankstr, '(I5.5)') myrank
myrankstr = 'proc'//trim(adjustl(myrankstr))//':'
devices_number = acc_get_num_devices(acc_device_nvidia)
mydev = mod(local_rank, devices_number)
call acc_set_device_num(mydev, acc_device_nvidia)
call acc_init(acc_device_nvidia)

print '(A,2I2)', myrankstr//' devices number, mydev', devices_number, mydev
call MPI_BARRIER(MPI_COMM_WORLD, ierr)

! allocate work arrays on host and devices
bytes = int(storage_size(buffer_dev)/8, c_size_t) * int(product(sizes), c_size_t)
cptr = acc_malloc_f(bytes)
if (c_associated(cptr)) call c_f_pointer(cptr, buffer_dev, shape=sizes)
allocate(buffer_hos(sizes(1),sizes(2),sizes(3))) ; buffer_hos = -1.0

! prepare buffer_dev array
!$acc parallel loop collapse(3) deviceptr(buffer_dev)
do k=1, sizes(3)
   do j=1, sizes(2)
      do i=1, sizes(1)
         if (myrank == 0) then
            buffer_dev(i,j,k) = 0.0
         else
            buffer_dev(i,j,k) = 1.0
         endif
      enddo
   enddo
enddo
! check buffer status
call acc_memcpy_from_device_f(c_loc(buffer_hos), c_loc(buffer_dev), bytes)
print '(A)', myrankstr//' buffer_dev array'
do k=1, sizes(3)
   do j=1, sizes(2)
      do i=1, sizes(1)
         print '(A,3I3,F5.1)', myrankstr//' i j k a:', i,j,k,buffer_hos(i,j,k)
      enddo
   enddo
enddo
call MPI_BARRIER(MPI_COMM_WORLD, ierr)

! MPI send from dev 1 to dev 0
!!$acc data deviceptr(buffer_dev)
!!$acc host_data use_device(buffer_dev)
if (myrank == 1) call MPI_SEND(buffer_dev, 6, MPI_REAL8, 0, 101, MPI_COMM_WORLD, ierr)
if (myrank == 0) call MPI_RECV(buffer_dev, 6, MPI_REAL8, 1, 101, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
!!$acc end host_data
!!$acc end data
call MPI_BARRIER(MPI_COMM_WORLD, ierr)

if (myrank == 0) then
   print '(A)', myrankstr//' check communication result'
   call acc_memcpy_from_device_f(c_loc(buffer_hos), c_loc(buffer_dev), bytes)
   print '(A)', myrankstr//' buffer_dev array'
   do k=1, sizes(3)
      do j=1, sizes(2)
         do i=1, sizes(1)
            print '(A,3I3,F5.1)', myrankstr//' i j k a:', i,j,k,buffer_hos(i,j,k)
         enddo
      enddo
   enddo
   if (any(int(buffer_hos) /= 1)) then
      print '(A)', myrankstr//' communication failed'
   else
      print '(A)', myrankstr//' communication done'
   endif
endif

call MPI_FINALIZE(ierr)
endprogram test_deviceptr_mpi

Note that “decorating” the send/recv calls with openacc directives has no effect, I obtain segfault with and without directives.

The complete output I got is the following.

proc00001: devices number, mydev 2 1
proc00000: devices number, mydev 2 0
proc00001: buffer_dev array
proc00001: i j k a:  1  1  1  1.0
proc00001: i j k a:  1  2  1  1.0
proc00001: i j k a:  1  1  2  1.0
proc00001: i j k a:  1  2  2  1.0
proc00001: i j k a:  1  1  3  1.0
proc00001: i j k a:  1  2  3  1.0
proc00000: buffer_dev array
proc00000: i j k a:  1  1  1  0.0
proc00000: i j k a:  1  2  1  0.0
proc00000: i j k a:  1  1  2  0.0
proc00000: i j k a:  1  2  2  0.0
proc00000: i j k a:  1  1  3  0.0
proc00000: i j k a:  1  2  3  0.0
[adam:120514:0:120514] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xec7358)
==== backtrace (tid: 120514) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000013528 ucc_event_manager_init()  /build-result/src/hpcx-v2.19-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucc-0b4a0780918900fa497b1e6a65485247fecec4a2/src/schedule/ucc_schedule.c:38
 2 0x0000000000013528 ucc_coll_task_init()  /build-result/src/hpcx-v2.19-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucc-0b4a0780918900fa497b1e6a65485247fecec4a2/src/schedule/ucc_schedule.c:126
 3 0x0000000000011592 ucc_tl_shm_get_task()  /build-result/src/hpcx-v2.19-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucc-0b4a0780918900fa497b1e6a65485247fecec4a2/src/components/tl/shm/barrier/../tl_shm_coll.h:62
 4 0x0000000000011592 ucc_tl_shm_barrier_init()  /build-result/src/hpcx-v2.19-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucc-0b4a0780918900fa497b1e6a65485247fecec4a2/src/components/tl/shm/barrier/barrier.c:107
 5 0x0000000000017ab0 ucc_coll_init()  /build-result/src/hpcx-v2.19-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucc-0b4a0780918900fa497b1e6a65485247fecec4a2/src/coll_score/ucc_coll_score_map.c:132
 6 0x0000000000010066 ucc_collective_init()  /build-result/src/hpcx-v2.19-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucc-0b4a0780918900fa497b1e6a65485247fecec4a2/src/core/ucc_coll.c:234
 7 0x0000000000004ce3 mca_coll_ucc_barrier_init()  /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/mca/coll/ucc/../../../../../ompi/mca/coll/ucc/coll_ucc_barrier.c:19
 8 0x0000000000004ce3 mca_coll_ucc_barrier()  /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/mca/coll/ucc/../../../../../ompi/mca/coll/ucc/coll_ucc_barrier.c:32
 9 0x00000000000618f8 PMPI_Barrier()  /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/mpi/c/profile/pbarrier.c:74
10 0x0000000000044e73 ompi_barrier_f()  /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/mpi/fortran/mpif-h/profile/pbarrier_f.c:76
11 0x0000000000403676 MAIN_()  /home/stefano/fortran/FUNDAL/compilers_proofs/oac/test_deviceptr_mpi.f90:95
12 0x00000000004024f1 main()  ???:0
13 0x0000000000029d90 __libc_init_first()  ???:0
14 0x0000000000029e40 __libc_start_main()  ???:0
15 0x00000000004023e5 _start()  ???:0
=================================
[adam:120514] *** Process received signal ***
[adam:120514] Signal: Segmentation fault (11)
[adam:120514] Signal code:  (-6)
[adam:120514] Failing at address: 0x3e80001d6c2
[adam:120514] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f99c1219520]
[adam:120514] [ 1] /opt/nvidia/hpc_sdk/Linux_x86_64/24.7/comm_libs/12.5/hpcx/hpcx-2.19/ucc/lib/libucc.so.1(ucc_coll_task_init+0xf8)[0x7f99a0c13528]
[adam:120514] [ 2] /opt/nvidia/hpc_sdk/Linux_x86_64/24.7/comm_libs/12.5/hpcx/hpcx-2.19/ucc/lib/ucc/libucc_tl_shm.so(ucc_tl_shm_barrier_init+0x92)[0x7f99ae411592]
[adam:120514] [ 3] /opt/nvidia/hpc_sdk/Linux_x86_64/24.7/comm_libs/12.5/hpcx/hpcx-2.19/ucc/lib/libucc.so.1(ucc_coll_init+0x110)[0x7f99a0c17ab0]
[adam:120514] [ 4] /opt/nvidia/hpc_sdk/Linux_x86_64/24.7/comm_libs/12.5/hpcx/hpcx-2.19/ucc/lib/libucc.so.1(ucc_collective_init+0x1c6)[0x7f99a0c10066]
[adam:120514] [ 5] /opt/nvidia/hpc_sdk/Linux_x86_64/24.7/comm_libs/12.5/hpcx/hpcx-2.19/ompi/lib/openmpi/mca_coll_ucc.so(mca_coll_ucc_barrier+0x73)[0x7f99a1004ce3]
[adam:120514] [ 6] /opt/nvidia/hpc_sdk/Linux_x86_64/24.7/comm_libs/12.5/hpcx/hpcx-2.19/ompi/lib/libmpi.so.40(MPI_Barrier+0x38)[0x7f99c48618f8]
[adam:120514] [ 7] /opt/nvidia/hpc_sdk/Linux_x86_64/24.7/comm_libs/12.5/hpcx/hpcx-2.19/ompi/lib/libmpi_mpifh.so.40(MPI_Barrier_f08+0x13)[0x7f99c4c44e73]
[adam:120514] [ 8] a.out[0x403676]
[adam:120514] [ 9] a.out[0x4024f1]
[adam:120514] [10] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f99c1200d90]
[adam:120514] [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f99c1200e40]
[adam:120514] [12] a.out[0x4023e5]
[adam:120514] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node adam exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I am wondering why I cannot mimic the C example in Fortran: is my test wrong or are there some differences between C and Fortran compiler implementation of OpenACC?

Any suggestions are much more than welcome,
Kind regards,
Stefano

Hi Stefano,

I wouldn’t use “acc_malloc” with Fortran code. It can “work” but is tricky to get right.

The problem here is that while buffer_dev’s data has a device address, itself is a host array. Hence when passed to the MPI routines, it’s being handle as if the data was on the host.

Instead, I’d recommend you use a CUDA Fortran “device” array instead, so the data is passed in correctly. It’s also much easier to work with since you don’t need the device memcpy calls nor the c_f_pointer.

For example:

program test_deviceptr_mpi
use :: iso_c_binding
use :: mpi
use :: openacc
use :: cudafor

implicit none

integer                   :: sizes(3)=[1,2,3]  ! arrays sizes
real, device, pointer     :: buffer_dev(:,:,:) ! device work array
real, allocatable, target :: buffer_hos(:,:,:) ! host work array
integer                   :: ierr              ! error status
integer                   :: procs_number      ! MPI processes number
integer                   :: myrank            ! MPI current ID
character(:), allocatable :: myrankstr         ! MPI ID stringified
integer                   :: local_comm        ! MPI local communicator
integer                   :: local_rank        ! local MPI split ID
integer                   :: devices_number    ! devices number
integer                   :: mydev             ! device current ID
integer                   :: i, j, k           ! counters

! initialize MPI and devices env
call MPI_INIT(ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, procs_number, ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr)
call MPI_COMM_SPLIT_TYPE(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, local_comm, ierr)
call MPI_COMM_RANK(local_comm, local_rank, ierr)
myrankstr = repeat(' ',5)
write(myrankstr, '(I5.5)') myrank
myrankstr = 'proc'//trim(adjustl(myrankstr))//':'
devices_number = acc_get_num_devices(acc_device_nvidia)
mydev = mod(local_rank, devices_number)
call acc_set_device_num(mydev, acc_device_nvidia)
call acc_init(acc_device_nvidia)

print '(A,2I2)', myrankstr//' devices number, mydev', devices_number, mydev
call MPI_BARRIER(MPI_COMM_WORLD, ierr)

! allocate work arrays on host and devices
allocate(buffer_hos(sizes(1),sizes(2),sizes(3))) ; buffer_hos = -1.0
allocate(buffer_dev(sizes(1),sizes(2),sizes(3)))

! prepare buffer_dev array
!$acc parallel loop collapse(3) deviceptr(buffer_dev)
do k=1, sizes(3)
   do j=1, sizes(2)
      do i=1, sizes(1)
         if (myrank == 0) then
            buffer_dev(i,j,k) = 0.0
         else
            buffer_dev(i,j,k) = 1.0
         endif
      enddo
   enddo
enddo
! check buffer status
buffer_hos=buffer_dev
print '(A)', myrankstr//' buffer_dev array'
do k=1, sizes(3)
   do j=1, sizes(2)
      do i=1, sizes(1)
         print '(A,3I3,F5.1)', myrankstr//' i j k a:', i,j,k,buffer_hos(i,j,k)
      enddo
   enddo
enddo
call MPI_BARRIER(MPI_COMM_WORLD, ierr)

! MPI send from dev 1 to dev 0
!!$acc data deviceptr(buffer_dev)
!!$acc host_data use_device(buffer_dev)
if (myrank == 1) call MPI_SEND(buffer_dev, 6, MPI_REAL8, 0, 101, MPI_COMM_WORLD, ierr)
if (myrank == 0) call MPI_RECV(buffer_dev, 6, MPI_REAL8, 1, 101, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
!!$acc end host_data
!!$acc end data
call MPI_BARRIER(MPI_COMM_WORLD, ierr)

if (myrank == 0) then
   print '(A)', myrankstr//' check communication result'
   buffer_hos=buffer_dev
   print '(A)', myrankstr//' buffer_dev array'
   do k=1, sizes(3)
      do j=1, sizes(2)
         do i=1, sizes(1)
            print '(A,3I3,F5.1)', myrankstr//' i j k a:', i,j,k,buffer_hos(i,j,k)
         enddo
      enddo
   enddo
   if (any(int(buffer_hos) /= 1)) then
      print '(A)', myrankstr//' communication failed'
   else
      print '(A)', myrankstr//' communication done'
   endif
endif

call MPI_FINALIZE(ierr)
endprogram test_deviceptr_mpi
% mpif90 test_deviceptr_mpi.F90 -acc -cuda
% mpirun -np 2 a.out
proc00000: devices number, mydev 2 0
proc00001: devices number, mydev 2 1
proc00000: buffer_dev array
proc00000: i j k a:  1  1  1  0.0
proc00000: i j k a:  1  2  1  0.0
proc00000: i j k a:  1  1  2  0.0
proc00000: i j k a:  1  2  2  0.0
proc00000: i j k a:  1  1  3  0.0
proc00000: i j k a:  1  2  3  0.0
proc00001: buffer_dev array
proc00001: i j k a:  1  1  1  1.0
proc00001: i j k a:  1  2  1  1.0
proc00001: i j k a:  1  1  2  1.0
proc00001: i j k a:  1  2  2  1.0
proc00001: i j k a:  1  1  3  1.0
proc00001: i j k a:  1  2  3  1.0
proc00000: check communication result
proc00000: buffer_dev array
proc00000: i j k a:  1  1  1  1.0
proc00000: i j k a:  1  2  1  1.0
proc00000: i j k a:  1  1  2  1.0
proc00000: i j k a:  1  2  2  1.0
proc00000: i j k a:  1  1  3  1.0
proc00000: i j k a:  1  2  3  1.0
proc00000: communication done

Hope this helps,
Mat

It may be a problem with UCC ( that is now enabled in 24.7 and it is quite buggy).
Your code works on my system with 24.7 but try to set this variable:
export OMPI_MCA_coll_ucc_enable=0

mpirun --mca coll ^hcoll -np 2 ./a.out
proc00001: devices number, mydev 4 1
proc00000: devices number, mydev 4 0
proc00001: buffer_dev array
proc00000: buffer_dev array
proc00000: i j k a: 1 1 1 0.0
proc00000: i j k a: 1 2 1 0.0
proc00000: i j k a: 1 1 2 0.0
proc00000: i j k a: 1 2 2 0.0
proc00000: i j k a: 1 1 3 0.0
proc00000: i j k a: 1 2 3 0.0
proc00001: i j k a: 1 1 1 1.0
proc00001: i j k a: 1 2 1 1.0
proc00001: i j k a: 1 1 2 1.0
proc00001: i j k a: 1 2 2 1.0
proc00001: i j k a: 1 1 3 1.0
proc00001: i j k a: 1 2 3 1.0
proc00000: check communication result
proc00000: buffer_dev array
proc00000: i j k a: 1 1 1 1.0
proc00000: i j k a: 1 2 1 1.0
proc00000: i j k a: 1 1 2 1.0
proc00000: i j k a: 1 2 2 1.0
proc00000: i j k a: 1 1 3 1.0
proc00000: i j k a: 1 2 3 1.0
proc00000: communication done

Hi Mat,

thank you very much for the lighting response, it is appreciated.

I wouldn’t use “acc_malloc” with Fortran code. It can “work” but is tricky to get right.

It works very well, indeed. The only issue we find is trying to MPI communicate directly with the device’s memory.

Instead, I’d recommend you use a CUDA Fortran “device” array instead, so the data is passed in correctly. It’s also much easier to work with since you don’t need the device memcpy calls nor the c_f_pointer.

You are right, CUDA Fortran is much easier than mixing C/Fortran with OpenACC directives, and we use it. However, our goal here is to adhere as much as possible to OpenACC to write a more portable code than CUDA Fortran, the device attribute being a Fortran extension.

Thank you again, Mat.

Stefano

Hi Massimiliano,

thank you very much for your support.

I have just tried using your export, but I have the same error. I think that something in my “env” is different from your, I report below the main information about my testing environment

Thu Sep  5 19:22:06 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.52.01              Driver Version: 555.99         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070        On  |   00000000:01:00.0  On |                  N/A |
|  0%   42C    P8             11W /  200W |    1205MiB /  12282MiB |     12%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4070        On  |   00000000:02:00.0 Off |                  N/A |
|  0%   36C    P8              8W /  200W |       0MiB /  12282MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A        29      G   /Xwayland                                   N/A      |
|    1   N/A  N/A        29      G   /Xwayland                                   N/A      |
+-----------------------------------------------------------------------------------------+

CUDA Driver Version:           12050

Device Number:                 0
Device Name:                   NVIDIA GeForce RTX 4070
Device Revision Number:        8.9
Global Memory Size:            12878086144
Number of Multiprocessors:     46
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    2475 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             10501 MHz
Memory Bus Width:              192 bits
L2 Cache Size:                 37748736 bytes
Max Threads Per SMP:           1536
Async Engines:                 1
Unified Addressing:            Yes
Managed Memory:                Yes
Concurrent Managed Memory:     No
Preemption Supported:          Yes
Cooperative Launch:            Yes
Unified Memory:                No
Memory Models Flags:           -gpu=mem:separate
Default Target:                cc89

I am testing on my workstation, I would probably have to test this also on Cineca cluster.

Could you post the output of ldd on your binary?

Could you post the output of ldd on your binary?

Sure, do you think I have messed up my environment?

┌╼ stefano@adam(07:23 PM Thu Sep 05) on main [+!?] desk {nvidia-24 - nvidia SDK 24.7 environment}
├───╼ ~/fortran/FUNDAL 14 files, 188Kb
└──────╼ ldd a.out
        linux-vdso.so.1 (0x00007ffd663ca000)
        libmpi_usempif08.so.40 => /opt/nvidia/hpc_sdk/Linux_x86_64/24.7/comm_libs/12.5/hpcx/hpcx-2.19/ompi/lib/libmpi_usempif08.so.40 (0x00007fe539000000)
        libmpi_usempi_ignore_tkr.so.40 => /opt/nvidia/hpc_sdk/Linux_x86_64/24.7/comm_libs/12.5/hpcx/hpcx-2.19/ompi/lib/libmpi_usempi_ignore_tkr.so.40 (0x00007fe538c00000)
        libmpi_mpifh.so.40 => /opt/nvidia/hpc_sdk/Linux_x86_64/24.7/comm_libs/12.5/hpcx/hpcx-2.19/ompi/lib/libmpi_mpifh.so.40 (0x00007fe538800000)
        libmpi.so.40 => /opt/nvidia/hpc_sdk/Linux_x86_64/24.7/comm_libs/12.5/hpcx/hpcx-2.19/ompi/lib/libmpi.so.40 (0x00007fe538400000)
        libacchost.so => /opt/nvidia/hpc_sdk/Linux_x86_64/2024/compilers/lib/libacchost.so (0x00007fe538000000)
        libaccdevaux.so => /opt/nvidia/hpc_sdk/Linux_x86_64/2024/compilers/lib/libaccdevaux.so (0x00007fe537c00000)
        libaccdevice.so => /opt/nvidia/hpc_sdk/Linux_x86_64/2024/compilers/lib/libaccdevice.so (0x00007fe537800000)
        libcudadevice.so => /opt/nvidia/hpc_sdk/Linux_x86_64/2024/compilers/lib/libcudadevice.so (0x00007fe537400000)
        libnvf.so => /opt/nvidia/hpc_sdk/Linux_x86_64/2024/compilers/lib/libnvf.so (0x00007fe536c00000)
        libnvomp.so => /opt/nvidia/hpc_sdk/Linux_x86_64/2024/compilers/lib/libnvomp.so (0x00007fe535a00000)
        libnvcpumath.so => /opt/nvidia/hpc_sdk/Linux_x86_64/2024/compilers/lib/libnvcpumath.so (0x00007fe535400000)
        libnvc.so => /opt/nvidia/hpc_sdk/Linux_x86_64/2024/compilers/lib/libnvc.so (0x00007fe535000000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fe534dd7000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fe5393c6000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fe5392df000)
        libopen-rte.so.40 => /opt/nvidia/hpc_sdk/Linux_x86_64/24.7/comm_libs/12.5/hpcx/hpcx-2.19/ompi/lib/libopen-rte.so.40 (0x00007fe534a00000)
        libopen-pal.so.40 => /opt/nvidia/hpc_sdk/Linux_x86_64/24.7/comm_libs/12.5/hpcx/hpcx-2.19/ompi/lib/libopen-pal.so.40 (0x00007fe534600000)
        libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007fe5392d8000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fe5392bc000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fe5392b7000)
        libnvhpcatm.so => /opt/nvidia/hpc_sdk/Linux_x86_64/2024/compilers/lib/libnvhpcatm.so (0x00007fe534200000)
        libatomic.so.1 => /lib/x86_64-linux-gnu/libatomic.so.1 (0x00007fe5392ab000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fe5392a6000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fe5392a1000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fe5393f5000)

The output looks good.
It should not matter, but could you try to:

source /opt/nvidia/hpc_sdk/Linux_x86_64/24.7/comm_libs/12.5/hpcx/hpcx-2.19/hpcx-init-ompi.sh

Done. The output is now slightly different:

┌╼ stefano@adam(07:54 PM Thu Sep 05) on main [+!?] desk {nvidia-24 - nvidia SDK 24.7 environment}
├───╼ ~/fortran/FUNDAL 14 files, 188Kb
└──────╼ mpirun -n 2 a.out
proc00000: devices number, mydev 2 0
proc00001: devices number, mydev 2 1
proc00001: buffer_dev array
proc00001: i j k a:  1  1  1  1.0
proc00001: i j k a:  1  2  1  1.0
proc00001: i j k a:  1  1  2  1.0
proc00001: i j k a:  1  2  2  1.0
proc00001: i j k a:  1  1  3  1.0
proc00001: i j k a:  1  2  3  1.0
proc00000: buffer_dev array
proc00000: i j k a:  1  1  1  0.0
proc00000: i j k a:  1  2  1  0.0
proc00000: i j k a:  1  1  2  0.0
proc00000: i j k a:  1  2  2  0.0
proc00000: i j k a:  1  1  3  0.0
proc00000: i j k a:  1  2  3  0.0
[adam:147102:0:147102] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0xb092fa000)
==== backtrace (tid: 147102) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x00000000001a07cd __nss_database_lookup()  ???:0
 2 0x000000000001bb9c uct_am_short_fill_data()  /build-result/src/hpcx-v2.19-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-7bb2722ff2187a0cad557ae4a6afa090569f83fb/src/uct/base/uct_iface.h:995
 3 0x000000000001bb9c uct_mm_ep_am_short()  /build-result/src/hpcx-v2.19-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-7bb2722ff2187a0cad557ae4a6afa090569f83fb/src/uct/sm/mm/base/mm_ep.c:405
 4 0x00000000000965cc uct_ep_am_short()  /build-result/src/hpcx-v2.19-gcc-mlnx_ofed-redhat7-cuda12-x86_64/ucx-7bb2722ff2187a0cad557ae4a6afa090569f83fb/src/uct/api/uct.h:2989
 5 0x0000000000005ac6 mca_pml_ucx_send_nbr()  /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx.c:909
 6 0x0000000000005ac6 mca_pml_ucx_send()  /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx.c:949
 7 0x000000000007295d PMPI_Send()  /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/mpi/c/profile/psend.c:81
 8 0x000000000004c5a3 ompi_send_f()  /var/jenkins/workspace/rel_nv_lib_hpcx_cuda12_x86_64/work/rebuild_ompi/ompi/build/ompi/mpi/fortran/mpif-h/profile/psend_f.c:78
 9 0x00000000004036e6 MAIN_()  /home/stefano/fortran/FUNDAL/compilers_proofs/oac/test_deviceptr_mpi.f90:91
10 0x00000000004024f1 main()  ???:0
11 0x0000000000029d90 __libc_init_first()  ???:0
12 0x0000000000029e40 __libc_start_main()  ???:0
13 0x00000000004023e5 _start()  ???:0
=================================
[adam:147102] *** Process received signal ***
[adam:147102] Signal: Segmentation fault (11)
[adam:147102] Signal code:  (-6)
[adam:147102] Failing at address: 0x3e800023e9e
[adam:147102] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f3fa5619520]
[adam:147102] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x1a07cd)[0x7f3fa57777cd]
[adam:147102] [ 2] /opt/nvidia/hpc_sdk/Linux_x86_64/24.7/comm_libs/12.5/hpcx/hpcx-2.19/ucx/mt/lib/libuct.so.0(uct_mm_ep_am_short+0x8c)[0x7f3fa96e2b9c]
[adam:147102] [ 3] /opt/nvidia/hpc_sdk/Linux_x86_64/24.7/comm_libs/12.5/hpcx/hpcx-2.19/ucx/mt/lib/libucp.so.0(ucp_tag_send_nbx+0x9c)[0x7f3fa97995cc]
[adam:147102] [ 4] /opt/nvidia/hpc_sdk/Linux_x86_64/24.7/comm_libs/12.5/hpcx/hpcx-2.19/ompi/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0xf6)[0x7f3f8f405ac6]
[adam:147102] [ 5] /opt/nvidia/hpc_sdk/Linux_x86_64/24.7/comm_libs/12.5/hpcx/hpcx-2.19/ompi/lib/libmpi.so.40(PMPI_Send+0x2d)[0x7f3fa8c7295d]
[adam:147102] [ 6] /opt/nvidia/hpc_sdk/Linux_x86_64/24.7/comm_libs/12.5/hpcx/hpcx-2.19/ompi/lib/libmpi_mpifh.so.40(mpi_send+0x53)[0x7f3fa904c5a3]
[adam:147102] [ 7] a.out[0x4036e6]
[adam:147102] [ 8] a.out[0x4024f1]
[adam:147102] [ 9] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f3fa5600d90]
[adam:147102] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f3fa5600e40]
[adam:147102] [11] a.out[0x4023e5]
[adam:147102] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node adam exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Dear Massimiliano and Mat ( @mfatica and @MatColgrove ), after a colleague verified that my test works on other clusters with different GPUs, I searched for a possible issue with my GPUs. I have found that Mat’s suggestion in another thread solves my issue, however, I do not understand all.

In this old Mat’s thread there are some hints about the fact that the GeForce RTX 40xy series have difference with respect other more computing-oriented devices, in particular, it seems that 40xy series do not have GPUDirect RDMA (I am not sure what this means). Following one of Mat’s suggestions I tried to set the following environment variable:

export UCX_MEMTYPE_CACHE=n

With this set, the test works fine

┌╼ stefano@adam(01:08 PM Fri Sep 06) on main [+!?] desk {nvidia-24 - nvidia SDK 24.7 environment}
├───╼ ~/fortran/FUNDAL 12 files, 172Kb
└──────╼ mpirun -np 2 a.out
proc00001: devices number, mydev 2 1
proc00000: devices number, mydev 2 0
proc00001: buffer_dev array
proc00001: i j k a:  1  1  1  1.0
proc00001: i j k a:  1  2  1  1.0
proc00001: i j k a:  1  1  2  1.0
proc00001: i j k a:  1  2  2  1.0
proc00001: i j k a:  1  1  3  1.0
proc00001: i j k a:  1  2  3  1.0
proc00000: buffer_dev array
proc00000: i j k a:  1  1  1  0.0
proc00000: i j k a:  1  2  1  0.0
proc00000: i j k a:  1  1  2  0.0
proc00000: i j k a:  1  2  2  0.0
proc00000: i j k a:  1  1  3  0.0
proc00000: i j k a:  1  2  3  0.0
proc00000: check communication result
proc00000: buffer_dev array
proc00000: i j k a:  1  1  1  1.0
proc00000: i j k a:  1  2  1  1.0
proc00000: i j k a:  1  1  2  1.0
proc00000: i j k a:  1  2  2  1.0
proc00000: i j k a:  1  1  3  1.0
proc00000: i j k a:  1  2  3  1.0
proc00000: communication done

This could be enough to use my workstation for development purposes. However, I would like to understand what is happening. I have a few questions:

  • what means setting UCX_MEMTYPE_CACHE=n?
  • do I still performing a multi GPU communication disabling only “cache optimization” (see this) or I am doing something “deeper” on the memory handling (e.g. disabling the direct communication of device memory)?
  • the test works without the OpenACC directives decorating the MPI send/receiv: I do not understand if this is a “lucky coincidence” even if they are necessary according to the OpenACC definition or if they do not.

Thank you very much for your kind support, it is appreciated.

Stefano

I ’m running UCX with GPU memory and getting a segfault, why?

Most likely UCX does not detect that the pointer is a GPU memory and tries to access it from CPU. It can happen if UCX is not compiled with GPU support, or fails to load CUDA or ROCm modules due to missing library paths or version mismatch. Please run ucx_info -d | grep cuda or ucx_info -d | grep rocm to check for UCX GPU support.

In some cases, the internal memory type cache can misdetect GPU memory as host memory, also leading to invalid memory access. This cache can be disabled by setting UCX_MEMTYPE_CACHE=n.

1 Like