Jetpack with cuda-aware OMPI could be default

@AastaLLL, I compiled and installed on my jetsons the OpenMPI/UCX as you described, then I wrote a small program to test the cuda-awareness by copying contents from the device in mpi_rank 0 to the device in mpi_rank 1. It won’t work, MPI complains about a bad address, which is solved when I restrict the copy from host memory to host memory. Please see below and I hope it serves for other people to try in their Tegra clusters:

#include <cstdio>
#include <mpi.h>

__global__ void print_val(float *data, const int LEN);

int main(int argc, char **argv)
	{
	const int	LENGTH		= 32;
	int			mpi_rank	= 0,
				mpi_size	= 0;
	float		host_data[LENGTH],
				*dev_data;

	MPI_Init(&argc, &argv);

	MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
	MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);

	if(mpi_rank == 0)
		for(int i = 0; i < LENGTH; i++)
			host_data[i] = (float) i * 0.5f;

	cudaMalloc((void **) &dev_data, LENGTH * sizeof(float));
	cudaMemset(&dev_data, LENGTH * sizeof(float), 0);

	if(mpi_rank == 0)
		{
		cudaMemcpy(dev_data, host_data, LENGTH * sizeof(float), cudaMemcpyHostToDevice);
		MPI_Send(dev_data, LENGTH, MPI_FLOAT, 1, 0, MPI_COMM_WORLD);
		}

	if(mpi_rank == 1)
		{
		MPI_Recv(dev_data, LENGTH, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
		//cudaMemcpy(dev_data, host_data, LENGTH * sizeof(float), cudaMemcpyHostToDevice); // uncomment if receiving in host_data
		print_val <<< 1, 1 >>> (dev_data, LENGTH);
		}

	cudaFree(dev_data);

	MPI_Finalize();
	
	return 0;
	}

__global__ void print_val(float *data, const int LEN)
	{
	printf("%.5f\n", data[LEN - 1]);
	}

I compiled it with the following lines:

nvcc -Xcompiler -Wall -c -o cuda_aware.o cuda_aware_test.cu -I/usr/lib/aarch64-linux-gnu/openmpi/include
mpic++.openmpi -o cuda_aware cuda_aware.o -I/usr/local/cuda-10.2/include -L/usr/local/cuda-10.2/lib64 -lcudart

Then I run with:

mpiexec.openmpi --hostfile ~/MPI_Nodes.txt --map-by ppr:1:node --mca btl_tcp_if_include 192.168.1.0/24 ./cuda_aware

MPI_Nodes.txt is my configuration file for MPI and it has the nodes of the cluster, and my ssh environment is already configured so the process will fire on the remote node without issues.

Notice that the nodes will have an array of floats, with rank 0 initializing it to some values and then all nodes will allocate space in the device. Rank 0 copies this initialized array to its device memory and tries to send it ro rank 1 device memory. If you want to receive in host memory, then uncomment the copy from host to device in rank 1 (but it won’t work either, because the bad address is when copying from the device memory in rank 0). In the end, rank 1 should print the last element from its device memory.

If you have a couple of jetsons ready to use in MPI, try all combinations you want, it will only work when copying from host memory to host memory (that is, no cuda-awareness).

Another comment I want to make, this time for the JetPack maintainers, is that in 4.6 you won’t be able to run any CUDA program unless it is done from docker. cuda-memcheck will say that all devices are busy or unavailable, and I could only fix this after reading this NV forums thread. I agree it should be fixed in next JP releases, just as it was in previous releases.

Let me know what you think.