Using managed memory with MPI

wajih.boukaram · January 7, 2020, 4:18am

I’ve been trying to use the cuda managed memory api to simplify my multi-gpu code. My test machine currently has two GPUs and the following simple code fails with the error “Fatal UVM CPU fault due to invalid operation”

#include <stdio.h>
#include <string.h>
#include <mpi.h>
#include <cuda_runtime.h>
#include "kernel.h"

int main(int argc, char *argv[])
{
	int myrank;
	MPI_Init(&argc, &argv);
	MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
	
	cudaSetDevice(myrank);
	int n = 256;
	double* d_data_send, *d_data_recv;
	
	cudaMallocManaged((void**)&d_data_send, sizeof(double) * n);
	cudaMallocManaged((void**)&d_data_recv, sizeof(double) * n);
	
	for(int i = 0; i < n; i++) d_data_send[i] = myrank;
	
	MPI_Request requests[2];
	MPI_Isend(d_data_send, n, MPI_DOUBLE, (myrank + 1) % 2, 0, MPI_COMM_WORLD, &requests[0]);
	MPI_Irecv(d_data_recv, n, MPI_DOUBLE, (myrank + 1) % 2, 0, MPI_COMM_WORLD, &requests[1]);
	
	double* d_data_processing;
	cudaMalloc((void**)&d_data_processing, sizeof(double) * n);
	set_seq(n, d_data_processing);
	
	MPI_Waitall(2, requests, MPI_STATUSES_IGNORE);
	printf("The deed is done\n");
	MPI_Finalize();
	return 0;
}

where set_seq is the following simple kernel:

__global__
void set_seq_kernel(int n, double* data)
{
	int i = blockIdx.x*blockDim.x + threadIdx.x;
	if (i < n) data[i] = i;
}

void set_seq(int n, double* data)
{
	int block_size = 256;
	set_seq_kernel<<<(n + block_size - 1) / block_size, block_size>>>(n, data);
}

I’m basically trying to do some computations while data is being exchanged between the two GPUs. If I don’t execute the kernel then there are no issues and if I don’t do the exchange and only do the kernel, everything works as well. When I try to do both, it fails with the above error when I do a cuda-memcheck. Am I doing something I’m not supposed to?

Robert_Crovella · January 7, 2020, 4:26am

what sort of GPU are you running on? What is the OS and CUDA version?

wajih.boukaram · January 7, 2020, 4:27am

I’m using two K40 GPUs with CUDA 10 on linux.

Robert_Crovella · January 7, 2020, 4:35am

add cudaDeviceSynchronize(); after the kernel call

void set_seq(int n, double* data)
{
	int block_size = 256;
	set_seq_kernel<<<(n + block_size - 1) / block_size, block_size>>>(n, data);
        cudaDeviceSynchronize();
}

and this may not completely fix the issue. It is illegal in a pre-pascal regime for host code to touch a managed allocation after a kernel has been launched, but before a cudaDeviceSynchronize() has been issued.

You are violating that rule.

wajih.boukaram · January 7, 2020, 4:59am

Thanks for the quick replies. If I add the sync it seems to work. Is the managed memory access you’re referring to the non-blocking MPI communication?
If I change the communication to blocking send and recv, the issue still pops up if I don’t use the synchronize.
Am I forced to synchronize after each kernel call when managed memory is touched on pre-pascal GPUs?

Edit: nvm, I had added in some code that printed out the received data after the kernel executed. Removing that access fixes it. Thanks for your help!

Robert_Crovella · January 7, 2020, 3:32pm

Yes, the non-blocking MPI communication is an issue here, and adding the cudaDeviceSynchronize() does not completely solve it, since that communication can occur at any time (between the call and wait operation).

Yes, in a pre-pascal environment, after launching a kernel, it is necessary that host code not touch any managed allocations until a cudaDeviceSynchronize() is issued. If you think through the ramifications of this coupled with your non-blocking MPI operations that can fire “at any time”, I think you will see the hazard.

Topic		Replies	Views
CUDA/MPI interoperability problem CUDA Programming and Performance	3	2062	December 20, 2013
[CUF] cuda-aware mpi send/recv segfault with cuda-memcheck Legacy PGI Compilers	7	3627	October 10, 2018
MPI + Peer2Peer combine MPI and Peer2Peer CUDA Programming and Performance	5	1814	February 8, 2012
Multi-GPU MPI launch failing when UVM enabled Legacy PGI Compilers	5	3777	January 2, 2019
Memory access error when using cuda+mpi Legacy PGI Compilers	4	3588	March 27, 2018
MPI and CUDA mixed programming General CUDA programming CUDA Programming and Performance	22	23697	July 27, 2010
Error while running an sample MPI with CUDA inside it. CUDA Programming and Performance	0	12983	July 28, 2010
multigpu portable memory problem CUDA Programming and Performance	1	1410	August 30, 2009
Kernel invocation invalidates unified memory blocks CUDA Programming and Performance	9	1072	January 8, 2018
about multi GPU control CUDA Programming and Performance	3	714	December 23, 2019

Using managed memory with MPI

Related topics