[CUF] cuda-aware mpi send/recv segfault with cuda-memcheck

I’m trying to exchange device data between two process on the same node.

Here my my code :

program main
	use mpi
	implicit none
	integer :: rank, ierr, tmp2
	integer,dimension(MPI_STATUS_SIZE) :: status
	integer,device,allocatable :: tmp(:)
	call mpi_init(ierr)
  call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
  if (rank .eq. 0) then
     tmp(1) = 56
     call mpi_send(tmp,1,MPI_INT,1,42,MPI_COMM_WORLD,ierr)
  end if
  if (rank .eq. 1) then
     call mpi_recv(tmp,1,MPI_INT,0,42,MPI_COMM_WORLD,status,ierr)
     tmp2 = tmp(1)
     print *, tmp2
  end if
	call mpi_finalize(ierr)
end program main

Compiled with :

mpif90 -Mcuda -g -O3 bug.f90 -o bug

$ompi_info --version
Open MPI v2.1.2
$ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
$mpif90 --version
pgfortran 18.4-0 64-bit target on x86-64 Linux -tp haswell
$cuda-memcheck --version
CUDA-MEMCHECK version 10.0.130 ID:(46)

Here is what happens when I run the code :

$mpirun -np 2 ./bug

$mpirun -np 1 cuda-memcheck ./bug : -np 1 ./bug
[:23497] *** Process received signal ***
[:23497] Signal: Segmentation fault (11)
[:23497] Signal code: Address not mapped (1)
[:23497] Failing at address: 0x198
========= Error: process didn’t terminate successfully
========= The application may have hit an error when dereferencing Unified Memory from the host. Please rerun the application under cuda-gdb or Nsight Eclipse Edition to catch host side errors.
========= No CUDA-MEMCHECK results found

Primary job terminated normally, but 1 process returned
a non-zero exit code… Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[32240,1],0]
Exit code: 1

I know there is no reason to call cuda-memcheck here, but my real code is obviously composed of cuda kernels. When running with cuda-gdb with set cuda memcheck on, no issue arises. Am I missing something obvious here? Thanks.

Hi dindon,

I think the problem here is that your program requires 2 ranks and will error with 1 rank.

% mpirun -np 1 bug
[sky4:199439] *** An error occurred in MPI_Send
[sky4:199439] *** reported by process [140730681065473,47304769798144]
[sky4:199439] *** on communicator MPI_COMM_WORLD
[sky4:199439] *** MPI_ERR_RANK: invalid rank
[sky4:199439] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[sky4:199439] ***    and potentially your MPI job)

% mpirun -np 2 cuda-memcheck bug
========= ERROR SUMMARY: 0 errors
========= ERROR SUMMARY: 0 errors

Hope this helps,

Hi Mat,

In the command I use to run my program, one process is calling ‘./bug’ while the other is calling ‘cuda-memcheck ./bug’ (two process in total) :

$mpirun -np 1 cuda-memcheck ./bug : -np 1 ./bug

I could have run for the same effect :

$mpirun -np 2 cuda-memcheck ./bug

Apologies that I missed that, though, the program still works for me.

% mpirun -np 1 cuda-memcheck ./bug : -np 1 ./bug
========= ERROR SUMMARY: 0 errors

I’m using OpenMPI on a Linux system with a V100.

Which MPI are you using? What is the output on the command ‘pgaccelinfo’?


I’m using OpenMPI 2.1.2 (shipped with PGI 18.4)

$ pgaccelinfo

CUDA Driver Version: 10000
NVRM version: NVIDIA UNIX x86_64 Kernel Module 410.57 Tue Sep 18 23:25:09 CDT 2018

Device Number: 0
Device Name: GeForce GTX 1070
Device Revision Number: 6.1
Global Memory Size: 8513978368
Number of Multiprocessors: 15
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1784 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 4004 MHz
Memory Bus Width: 256 bits
L2 Cache Size: 2097152 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: Yes
Preemption Supported: Yes
Cooperative Launch: Yes
Multi-Device: Yes
PGI Compiler Option: -ta=tesla:cc60

Just gave the code a try on a GTX 1070 here using PGI 18.4 and the OpenMPI that ships with the 18.4 compilers. For good or bad, it still works for me. Though, I only have a CUDA 8.0 driver on this system.

I’m guessing that it’s something to do with your system such as a mismatch between the cuda-memcheck version and your CUDA driver version.

Looks like you’re using a CUDA 10.0 driver. Which version of cuda-memcheck are you using? Can you try getting the CUDA 10.0 SDK and seeing if that version fixes the problem?


You’re right. I used cuda-memcheck (from CUDA toolkit 10) while the code was running with the CUDA 8 libraries.

My pgi install isn’t shipped with CUDA 10 and I couldn’t make it use my own installation of CUDA 10, but after getting cuda-memcheck from the CUDA toolkit 8, the code works.

Thanks Mat.

Hi dindon,

Starting with 18.4, you can set the environment variable “CUDA_HOME” to your CUDA installation to have the compiler use this version. Just don’t include the CUDA version option to the compiler flags “-ta” or “-Mcuda” since this overrides this.

Granted CUDA 10.0 came out six months after PGI 18.4 so there may be other issues, but you can give this a try as well.