Segfault with MPI_Send + acc_malloc

Hi,

I’m on a node with two GPUs attached, running with CUDA-enabled MPI (got it from installing the PGI toolkit) , and I bumped into a weird segfault that I managed to pinpoint to the following minimal reproducer:

#include <mpi.h>
#include "openacc.h"
#include <cuda.h>
#include <cuda_runtime.h>

#include "mpi-ext.h" /* Needed for CUDA-aware check */

int main(int argc, char* argv[])
{
  MPI_Init(&argc, &argv);

  if (1 == MPIX_Query_cuda_support()) {
      printf("This MPI library has CUDA-aware support.\n");
  } else {
      printf("This MPI library does not have CUDA-aware support.\n");
  }

  int rank = -1; 
  MPI_Comm_rank(MPI_COMM_WORLD,&rank);
  //printf("rank=%d\n", rank);
  
  int ngpus = acc_get_num_devices(acc_device_nvidia);
  int devicenum = (rank)%(ngpus);
  //printf("devicenum=%d\n", devicenum);
 
  acc_set_device_num(devicenum,acc_device_nvidia);
  acc_init(acc_device_nvidia);
  
  //int buffer[10];
  int *buffer = acc_malloc((size_t)10*sizeof(int));
  for (int i=0; i<10; i++) buffer[i] = i;
  #pragma acc enter data copyin(buffer[:10])
  
  if (rank == 0) {   
    MPI_Send(acc_deviceptr(buffer), 10, MPI_INT, 1, 0, MPI_COMM_WORLD);
  }
  else {
    MPI_Recv(acc_deviceptr(buffer), 10, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    #pragma acc update host(buffer[:10])
    printf("rank=1, %d\n", buffer[2]);
  }

  //acc_free(buffer);
  #pragma acc exit data delete(buffer)
  
  MPI_Finalize();
}

If I comment out

  int *buffer = acc_malloc((size_t)10*sizeof(int));

and replace it with

int buffer[10];

then it works and no segfault is thrown.

Here’s the error trace:

[pgi-nc12-openacc:43092] *** Process received signal ***
[pgi-nc12-openacc:43092] Signal: Segmentation fault (11)
[pgi-nc12-openacc:43092] Signal code: Invalid permissions (2)
[pgi-nc12-openacc:43092] Failing at address: 0x1f03a5a000
[pgi-nc12-openacc:43093] *** Process received signal ***
[pgi-nc12-openacc:43093] Signal: Segmentation fault (11)
[pgi-nc12-openacc:43093] Signal code: Invalid permissions (2)
[pgi-nc12-openacc:43093] Failing at address: 0x1f03a5a000
[pgi-nc12-openacc:43092] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f1056fb0890]
[pgi-nc12-openacc:43092] [ 1] ./mfe[0x401381]
[pgi-nc12-openacc:43092] [ 2] [pgi-nc12-openacc:43093] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f752da52890]
[pgi-nc12-openacc:43093] [ 1] ./mfe[0x401381]
[pgi-nc12-openacc:43093] [ 2] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f752cb64b97]
[pgi-nc12-openacc:43093] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f10560c2b97]
[pgi-nc12-openacc:43092] [ 3] ./mfe[0x4011ea]
[pgi-nc12-openacc:43092] *** End of error message ***
[ 3] ./mfe[0x4011ea]
[pgi-nc12-openacc:43093] *** End of error message ***

In particular, see that “Invalid permission”.

Everything is open source and running on a VM , so if at a loss (I am :-) ) I’m willing to give ssh access to the machine and the commands to reproduce.

BTW, I’m using pgcc 19.10

Thanks a lot!

Hi fablup,

“acc_malloc” returns a device pointer so can’t be accessed on the host and why you’re getting a segv. Instead, only use “buffer” on the device. For example:

% cat test_acc_malloc.c

#include <mpi.h>
#include "openacc.h"
#include "mpi-ext.h" /* Needed for CUDA-aware check */

int main(int argc, char* argv[])
{
  MPI_Init(&argc, &argv);

  if (1 == MPIX_Query_cuda_support()) {
      printf("This MPI library has CUDA-aware support.\n");
  } else {
      printf("This MPI library does not have CUDA-aware support.\n");
  }

  int rank = -1;
  MPI_Comm_rank(MPI_COMM_WORLD,&rank);
  //printf("rank=%d\n", rank);

  int ngpus = acc_get_num_devices(acc_device_nvidia);
  int devicenum = (rank)%(ngpus);
  //printf("devicenum=%d\n", devicenum);

  acc_set_device_num(devicenum,acc_device_nvidia);
  acc_init(acc_device_nvidia);

  //int buffer[10];
  int *buffer = acc_malloc((size_t)10*sizeof(int));
  #pragma acc parallel loop deviceptr(buffer)
  if (rank == 0) {
    for (int i=0; i<10; i++) buffer[i] = i;
  }

  if (rank == 0) {
    MPI_Send(buffer, 10, MPI_INT, 1, 0, MPI_COMM_WORLD);
  }
  else {
    MPI_Recv(buffer, 10, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
#pragma acc serial deviceptr(buffer)
{
    printf("rank=1, %d\n", buffer[2]);
}
  }
  acc_free(buffer);

  MPI_Finalize();
}

% mpicc -ta=tesla -Minfo=accel test_acc_malloc.c
main:
     30, Generating Tesla code
         31, #pragma acc loop gang, vector(10) /* blockIdx.x threadIdx.x */
     40, Accelerator serial kernel generated
         Generating Tesla code
% mpirun -np 2 a.out
This MPI library has CUDA-aware support.
This MPI library has CUDA-aware support.
rank=1, 2

Though the better solution is to use the “host_data” directive to pass the device pointer to the MPI calls. For example:

% cat test_host_data.c
#include <stdlib.h>
#include <mpi.h>
#include "openacc.h"
#include "mpi-ext.h" /* Needed for CUDA-aware check */

int main(int argc, char* argv[])
{
  MPI_Init(&argc, &argv);

  if (1 == MPIX_Query_cuda_support()) {
      printf("This MPI library has CUDA-aware support.\n");
  } else {
      printf("This MPI library does not have CUDA-aware support.\n");
  }

  int rank = -1;
  MPI_Comm_rank(MPI_COMM_WORLD,&rank);
  //printf("rank=%d\n", rank);

  int ngpus = acc_get_num_devices(acc_device_nvidia);
  int devicenum = (rank)%(ngpus);
  //printf("devicenum=%d\n", devicenum);

  acc_set_device_num(devicenum,acc_device_nvidia);
  acc_init(acc_device_nvidia);

  int *buffer = (int *) malloc((size_t)10*sizeof(int));
  if (rank == 0) {
     for (int i=0; i<10; i++) buffer[i] = i;
  }
#pragma acc enter data copyin(buffer[:10])

  if (rank == 0) {
#pragma acc host_data use_device(buffer)
{
    MPI_Send(buffer, 10, MPI_INT, 1, 0, MPI_COMM_WORLD);
}
  }
  else {
#pragma acc host_data use_device(buffer)
{
    MPI_Recv(buffer, 10, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
#pragma acc update self(buffer[:10])
    printf("rank=1, %d\n", buffer[2]);
  }
#pragma acc exit data delete(buffer)

  MPI_Finalize();
}

% mpicc -ta=tesla -Minfo=accel test_host_data.c
main:
     33, Generating enter data copyin(buffer[:10])
     45, Generating update self(buffer[:10])
     49, Generating exit data delete(buffer[:1])
% mpirun -np 2 a.out
This MPI library has CUDA-aware support.
This MPI library has CUDA-aware support.
rank=1, 2

Hope his helps,
Mat

Sorry, there was a copy-paste leftover in the original question: I forgot:

  1. to comment the acc_malloc and replace it by un-commenting the line above (int buffer[10] ) and
  2. placing a parloop over the forloop initializing buffer.
    Sorry about that!

Said that, I managed to fix my bug following your guidelines – for me the key was using the deviceptr(buffer) clause. Thanks!

I’m fairly surprised by openacc not behaving without that clause. I’d have expected that entering a parallel region openacc checks each pointer and:

  • it first looks up an internal map to see whether it’s a host pointer, and if so, it returns the corresponding device pointer
  • then, if no entry in the map is available, it checks whether what was provided is actually a device pointer already (all device pointers returned by acc_malloc and siblings should be known right?), and in such a case it proceeds by just using it

I’m fairly confused about this. Could you shed some light on this, or could you point me to some relevant document?

  • it first looks up an internal map to see whether it’s a host pointer, and if so, it returns the corresponding device pointer
  • then, if no entry in the map is available, it checks whether what was provided is actually a device pointer already (all device pointers returned by acc_malloc and siblings should be known right?), and in such a case it proceeds by just using it

For compiler runtime managed data, i.e. data managed via the data directives or acc_copy API calls, the runtime creates a “present” table which will map the host copy to the device’s mirrored copy.

Device pointers returned by acc_malloc are user managed with no associated host copy so no entry in the present table is made. So by using “acc_malloc” (or cudaMalloc if targeting NVIDIA GPUs), it’s the programmer’s responsibility to manage the data, including informing the compiler that the variable is a device pointer via the “deviceptr” clause.

If you do want the runtime data management, please use the second example I provided.

-Mat