Cuda aware mpi, derived data type issue

Hello, I have questions about cuda aware mpi and not sure how/why it works/does not work.
Thanks.
I illustrate my issue with a simplified example. I define a struct with
2 pointers to double f.a and f.b, each allocated as arrays with 6 doubles on host.
I define an MPI communicator of size 2, with the first 3 entries of f.a and f.b of rank 0
sent to the last 3 entries of f.a and f.b of rank 1 and vice versa.
First I wrote a program only with MPI communication in host, no variables on device involved.
So I construct an MPI derived data type and did the MPI Isend/recv/wait communications.
I got the expected output:

I am rank 1, of size 2
Before comm, rank 1, f.a[0]=2.000000, f.b[0]=20.000000
Before comm, rank 1, f.a[1]=2.000000, f.b[1]=20.000000
Before comm, rank 1, f.a[2]=2.000000, f.b[2]=20.000000
Before comm, rank 1, f.a[3]=0.000000, f.b[3]=0.000000
Before comm, rank 1, f.a[4]=0.000000, f.b[4]=0.000000
Before comm, rank 1, f.a[5]=0.000000, f.b[5]=0.000000
I am rank 1, still ok.
After comm, rank 1, f.a[0]=2.000000, f.b[0]=20.000000
After comm, rank 1, f.a[1]=2.000000, f.b[1]=20.000000
After comm, rank 1, f.a[2]=2.000000, f.b[2]=20.000000
After comm, rank 1, f.a[3]=1.000000, f.b[3]=10.000000
After comm, rank 1, f.a[4]=1.000000, f.b[4]=10.000000
After comm, rank 1, f.a[5]=1.000000, f.b[5]=10.000000
I am rank 0, of size 2
Before comm, rank 0, f.a[0]=1.000000, f.b[0]=10.000000
Before comm, rank 0, f.a[1]=1.000000, f.b[1]=10.000000
Before comm, rank 0, f.a[2]=1.000000, f.b[2]=10.000000
Before comm, rank 0, f.a[3]=0.000000, f.b[3]=0.000000
Before comm, rank 0, f.a[4]=0.000000, f.b[4]=0.000000
Before comm, rank 0, f.a[5]=0.000000, f.b[5]=0.000000
I am rank 0, still ok.
After comm, rank 0, f.a[0]=1.000000, f.b[0]=10.000000
After comm, rank 0, f.a[1]=1.000000, f.b[1]=10.000000
After comm, rank 0, f.a[2]=1.000000, f.b[2]=10.000000
After comm, rank 0, f.a[3]=2.000000, f.b[3]=20.000000
After comm, rank 0, f.a[4]=2.000000, f.b[4]=20.000000
After comm, rank 0, f.a[5]=2.000000, f.b[5]=20.000000

Next, I modified the program and declare device variable f_d (and cudaMalloc f_d.a f_d.b),
I construct the MPI datatypes for f_d and copy f.a, f.b to f_d.a, f_d.b resp…
then I implement the MPI communication with variables on devices,
and copy back from device to host and wish to get the same output.
However, I failed right before the MPI Isend.
The error I got is
Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f8a66400000)
and a list of backtrace link the those libraries in nvidia openmpi.
Here is my code:
Header:

struct teststruc {
double *a;
double *b;
};

#ifndef EXTERN
#define EXTERN extern
#endif

EXTERN int n;
EXTERN struct teststruc f, f_d;

Main function:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <mpi.h>
#define EXTERN
#include "teststructptr.h"

int main(int argc, char *argv[]) {

MPI_Comm comm1=MPI_COMM_WORLD;
int nproc, rank;
MPI_Request request;
MPI_Status status;

MPI_Init(&argc, &argv);
MPI_Comm_size(comm1, &nproc);
MPI_Comm_rank(comm1, &rank);
int iroot=0;
printf("I am rank %d, of size %d\n", rank, nproc);
    int dev_id = -1;
    {
        MPI_Comm local_comm;
        MPI_Info info;
        MPI_Info_create(&info) ;
        MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank, info, &local_comm) ;

        MPI_Comm_rank(local_comm,&dev_id) ;

        MPI_Comm_free(&local_comm) ;
        MPI_Info_free(&info) ;
    }

    {
        int num_devs = 0;
        cudaGetDeviceCount( &num_devs ) ;
        dev_id = dev_id % num_devs;
    }
    cudaSetDevice( dev_id ) ;


//  demonstrate f.a/b[1,2,3] of rank 0 to f.a/b[4,5,6] of rank 1
//  and vice versa.
//  allocate memory and initialize
  n=3;

  f.a = (double*) malloc(2*n*sizeof(double));
  f.b = (double*) malloc(2*n*sizeof(double));
  memset(f.a,0,2*n*sizeof(double));
  memset(f.b,0,2*n*sizeof(double));
  for (int i=0; i<n; ++i) {
          f.a[i]=rank+1;
          f.b[i]=(rank+1)*10;
  }

  cudaMalloc(&f_d.a, 2*n*sizeof(double));
  cudaMalloc(&f_d.b, 2*n*sizeof(double));
//copy h2d
  cudaMemcpy( f_d.a, f.a, 2*n*sizeof(double), cudaMemcpyHostToDevice);
  cudaMemcpy( f_d.b, f.b, 2*n*sizeof(double), cudaMemcpyHostToDevice);

  for (int i=0; i<2*n; ++i) {
      printf("Before comm, rank %d, f.a[%d]=%lf, f.b[%d]=%lf\n", rank, i,f.a[i],i, f.b[i]);
  }
  int sour=rank;
  int dest=(rank+1)%nproc;

// construct derived data type
int const ntype=2;
int array_of_blocklengths[ntype];
MPI_Datatype array_of_types[ntype], stype, rtype;
MPI_Aint base, array_of_displacements[ntype];

MPI_Get_address(&f_d, &base);

array_of_blocklengths[0]=n;  array_of_blocklengths[1]=n;
array_of_types[0]=MPI_DOUBLE; array_of_types[1]=MPI_DOUBLE;

MPI_Get_address(f_d.a, &array_of_displacements[0]);
MPI_Get_address(f_d.b, &array_of_displacements[1]);
array_of_displacements[0]=MPI_Aint_diff(array_of_displacements[0], base);
array_of_displacements[1]=MPI_Aint_diff(array_of_displacements[1], base);
MPI_Type_create_struct(ntype, array_of_blocklengths, array_of_displacements, array_of_types, &stype);
MPI_Type_commit(&stype);

MPI_Get_address(f_d.a+n, &array_of_displacements[0]);
MPI_Get_address(f_d.b+n, &array_of_displacements[1]);
array_of_displacements[0]=MPI_Aint_diff(array_of_displacements[0], base);
array_of_displacements[1]=MPI_Aint_diff(array_of_displacements[1], base);
MPI_Type_create_struct(ntype, array_of_blocklengths, array_of_displacements, array_of_types, &rtype);
MPI_Type_commit(&rtype);
int tag1=1;
printf("I am rank %d, still ok.\n", rank);
//communication
      MPI_Isend(&f_d, 1, stype, dest, tag1, comm1, &request);
      MPI_Recv( &f_d, 1, rtype, dest, tag1, comm1, &status);
      MPI_Wait(&request, &status);

//copy d2h
  cudaMemcpy( f.a, f_d.a, 2*n*sizeof(double), cudaMemcpyDeviceToHost);
  cudaMemcpy( f.b, f_d.b, 2*n*sizeof(double), cudaMemcpyDeviceToHost);

//output
  for (int i=0; i<2*n; ++i) {
      printf("After comm, rank %d, f.a[%d]=%lf, f.b[%d]=%lf\n", rank, i,f.a[i],i, f.b[i]);
  }

  MPI_Type_free(&stype);
  MPI_Type_free(&rtype);

  free(f.a);
  free(f.b);
return 0;
}

I found that if I don’t construct the derived data type, and use 2 MPI send recv on f_d.a and f_d.b,
it works as expected, but the number of communications doubles.

Anything I have missed?

And then, I tried to declare a single array g (and g_d) that can store both f.a and f.b,
have MPI send and recv on g_d, with a data type constructed by MPI_Type_vector with 2 blocks.
This works. However, I profiled the test and found that there are a pair of d2h and a pair of h2d
copies, each of 24 bytes x2 corresponding to the 2 blocks, but I believe the communication
should be on device and should not be copied to and from the host.
While not using derived datatype, send recv f_d.a f_d.b separately, I did not see such host device copying.
I read some discussions on the copying issue in cuda with MPI but not sure whether they are exactly the same issue
I encountered, and excuse me if I inquired a posted issue regarding this part.
I attached this code as well.
I also tried similar things in cudafortran and noticed the same issue.

Thanks in advance for help.
sendrecv_struct2.txt (2.6 KB)

f_d is a structure in host memory. it is not a device variable

1 Like

Thanks Robert. I see your point.
Then, if we want to communicate between 2 devices
with cuda aware mpi on 2 or more components of a struct like f_d.a and f_d.b on device, how/is it possible to do that with a single send-recv, if this mpi derive data type construction approach is not the way (or is there a way to make it work again)?

I tried to declare f_d as pointer to teststruc, and then I cudaMalloc f_d,
and then hoping f_d is pointing to memory on device,
it still compiled but I got a crash right before allocating a and b.

  cudaMalloc(&f_d, sizeof(teststruc));
  cudaMalloc(&f_d->a, 2*n*sizeof(double));
  cudaMalloc(&f_d->b, 2*n*sizeof(double));

I got

 (Segmentation fault: invalid permissions for mapped object at address 0x7f5d3a400000)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0)

Any step and concept I am still missing? Thanks.
Excuse me that I don’t have much experience in cuda.

That is illegal in CUDA. You have previously done a cudaMalloc operation on f_d, a location in host memory, assigning it a pointer that points to device memory. Now, in your next cudaMalloc operation, you are attempting to tell cudaMalloc to store its allocated pointer value in a location in device memory. That is not how cudaMalloc works. It requires a location in host memory to store the allocated pointer value in.

This:

&f_d

is a location in host memory.

This:

&f_d->a

is a location in device memory.

This topic has been covered numerous times, in numerous forum posts. Here is one example but there are many others. It falls into a general area of behavior that I would refer to as deep-copy methodology, which would normally be required for allocating a structure in device memory, that has embedded pointers that point to dynamically allocated locations in device memory.

I probably won’t be able to answer the question for you “how do I create a derived type in CUDA-aware MPI that refers to two separate allocated device side allocations but only requires one send/receive op to transfer both of them”

When I have used cuda-aware MPI in the past, people took pains to organize problems (even things like halo exchange, where you could imagine multiple separate regions to be transferred) such that this approach never came up.

In CUDA, its not possible to transfer two separate dynamically allocated regions with a single cudaMemcpy operation. it would require 2. It’s entirely possible that the implementation of cuda-aware MPI might seek to work around that limitation by doing some kind of arbitrary, work-extensible copy-kernel approach, but I kind of doubt it, and even if it did, I certainly don’t know how to tap into it.

So there might be a way to transfer multiple independent regions with a single send/recv op, but I don’t know how to do it, and I suspect that even if you figured out a way, there are going to be two transfers under the hood.

Instead, what I have seen in the past, is that when people have multiple independent chunks of data to transfer via cuda-aware MPI, they create scatter/gather steps that they insert into the processing pipeline. Something like this pseudo code:

for i in iteration_steps:

  • copy from transfer buffer to actual halo or wherever the independent chunks of data need to go
  • perform my iteration step work on my chunk of data
  • copy the new independent chunk data into the transfer buffer
  • do the send/recv op on the transfer buffer

The two copy steps I list above are device-to-device copies, so they are pretty quick.

I’m fairly certain that if you look through this body of work, you can get some ideas.

1 Like

Thanks for your detailed reply, and excuse me for bring up a topic that people
discussed somewhere else. I am looking at it to find a possibly better way than what I
am currently doing.
And yes the motivation of my question is actually for halo communication, and I also tried what you suggested that allocating an extra buffer to store all the data I need to communicate and do a send/recv and copy on device, just wondered whether I could do it with MPI data type constructor.
Once again thanks for the suggestion and insight to the issue, and yes the github linked is very informative for what I should know.