multigpu portable memory problem

szaboka · August 28, 2009, 12:57pm

I have some problems with setting up multigpu code. The setup is the following:

two threads (communicating with MPI), each driving one GPU (runtime API). I am

trying to allocate a mapped and portable memory on each thread, so that both

threads can work with both buffers. The allocation works, and I can send the

address of the allocated buffer via MPI to the other thread. When I am trying

to get the buffer address mapped to the device, I get an error in case of the

buffer of the other thread.

Anyone has idea, if this setup can work at all, and if yes, how to fix it?

Most probably I am not correctly interpreting the statement "portable memory

can be used by all threads". What is the proper way to ‘broadcast’ a portable

memory address among threads? I havent found any example for that.

I attach a simple test case:

#include <mpi.h>

#include <cuda_runtime.h>

#define C_CALL(call) \

	do {\

		cudaError err = call;\

		if( cudaSuccess != err)\

		{\

			fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n",\

					__FILE__, __LINE__, cudaGetErrorString(err) );\

			exit(EXIT_FAILURE);\

		}\

	} while (0)

//compile: nvcc -lmpi cudacomm.cu

//run: mpirun -np 2 ./a.out

int main(int argc, char *argv[])

{

	//set up MPI

	int node;

	MPI_Status status;

	MPI_Init(&argc, &argv);

	MPI_Comm_rank(MPI_COMM_WORLD, &node);

	int othernode= 1-node;

	//set up CUDA

	C_CALL(cudaSetDevice(node));

	C_CALL(cudaSetDeviceFlags(cudaDeviceMapHost));

	//alloc sendbuf = portable + mapped memory on host

	float2 *sendbuf;

	C_CALL(cudaHostAlloc((void **) &sendbuf, 1000*sizeof(float2),

				cudaHostAllocPortable|cudaHostAllocMapped));

	//recvbuf= other thread's sendbuf

	float2 *recvbuf;

	MPI_Sendrecv(&sendbuf, sizeof(float2 *), MPI_CHAR, othernode, 0,

			&recvbuf, sizeof(float2 *), MPI_CHAR, othernode, 0,

			MPI_COMM_WORLD, &status);

	printf("node= %d send= %Ld recv= %Ld\n", node, (long)sendbuf, (long)recvbuf);

	//get buffer addresses mapped to device mem

	float2 *d_sendbuf, *d_recvbuf;

	C_CALL(cudaHostGetDevicePointer((void **) &d_sendbuf, sendbuf, 0));

	//recv FAILS with "invalid argument"

	C_CALL(cudaHostGetDevicePointer((void **) &d_recvbuf, recvbuf, 0));

	MPI_Finalize();

	return 0;

}

szaboka · August 30, 2009, 11:28am

I guess the problem is in the difference between threads and processes. Portable memory is portable between threads, while a parallelization via MPI creates processes instead of threads. It would be better to try it with pthread/OpenMP like parallelization, where one has threads instead of processes. Maybe.

Topic		Replies	Views
CUDA/MPI interoperability problem CUDA Programming and Performance	3	2055	December 20, 2013
how to use portable pinned memory for multiple gpu CUDA Programming and Performance	1	3053	September 7, 2009
MultiGPUs newbie question Data transformation problem CUDA Programming and Performance	12	5152	March 18, 2008
MultiGPU example in the CUDA SDK some stack problems CUDA Programming and Performance	5	3125	March 11, 2018
Mapped memory across multiple GPUs CUDA Programming and Performance	3	8738	October 28, 2010
CUDA + CPU threads CUDA Programming and Performance	5	11656	August 20, 2008
MPI + Peer2Peer combine MPI and Peer2Peer CUDA Programming and Performance	5	1814	February 8, 2012
CPU threads and CUDA CUDA Programming and Performance	8	7215	January 15, 2018
Launch multiple kernels while using milti-process service (MPS) CUDA Programming and Performance	6	1299	December 12, 2014
Multithreading and CUDA CUDA Programming and Performance	6	9083	April 14, 2010

multigpu portable memory problem

Related topics