Jetpack with cuda-aware OMPI could be default

I am using JP 4.6 on a few Xaviers NX that make a small cluster and noticed the current OpenMPI installed is not cuda-aware capable, thus needing recompilation.
Maybe it could be considered for future JetPack releases?

======== UPDATE ========

I was trying to compile OpenMPI with cuda-aware support by following the documentation: OpenMPI Build CUDA, however, GDRCopy is not meant to be used with Tegra according to this post: Github GDRCopy.

Then Mat Colgrove mentions that UCX is not necessary for cuda-aware OMPI to work, see this SO post.

You can see in OMPI doc that in order to build UCX, it already has to point to GDRCopy when configuring (./configure --prefix=/path/to/ucx-cuda-install --with-cuda=/usr/local/cuda --with-gdrcopy=/usr). Since it will not compile on Tegra, I assume it can be omitted.

When recompiling OpenMPI, should I stick to the default 2.1 version (flagged as retired in OpenMPI page) that comes with Ubuntu 18.04 in JetPack 4.6 or it is ok to go to 4.1? If you have any suggestion, feel free to comment.

======= UPDATE 2 ========

I managed to compile ucx version 1.11 (1.6 as suggested by the above link is a no-go) and then OpenMPI 2.1.1 from the tarballs, both with cuda support.
When compiling and running the compile-time and run-time checker program from cuda-aware support, it outputs that it is cuda-aware for compile-time, but not for run-time.

Checking the mpi-ext.h header that it needs (which was installed in another directory by the OMPI compilation, so I had to fix some symlinks for mpicc to find it), it seems to be the macro MPIX_CUDA_AWARE_SUPPORT defined with value 1 in the file mpiext_cuda_c.h that the program checks for compile-time (the Jetpack 4.6 factory version has value 0), but the function MPIX_Query_cuda_support() doesn’t return 1, thus failing for run-time cuda-awareness (which I believe is what is needed).

If anyone had luck with cuda-awareness with Tegra, let me know.

Hi,

Just checked the OpenMPI’s document.
It seems most of the document is for dGPU.

Would you mind double-checking with the OpenMPI team to see if they support the integrated GPU first?

Thanks.

I have just posted the question on OMPI github page and will update here as soon as they reply there.
It could very well be the case, just like with gdrcopy.

======= QUICK UPDATE (11/01/2022) =======

One of OpenMPI’s contributor’s included Tommy Janjusic in the conversation, who seems to be a NVidia programmer working on the lib, so I am just waiting for him to step in and provide some insight. Here.

Hi,

Thanks for checking this with the OpenMPI team.

We are going to compile the library on Jetson to see if any quick fix for the issue.
Will share more information with you later.

Thanks.

Hi,

We can build OpenMPI+CUDA on JetPack 4.6 without issues.
Below is our building steps for your reference:

1. Set environment

$ export CUDA_HOME="/usr/local/cuda"
$ export UCX_HOME="/usr/local/ucx"
$ export OMPI_HOME="/usr/local/ompi"
$ export PATH="${CUDA_HOME}/bin:$PATH}"
$ export PATH="{UCX_HOME}/bin:$PATH}"
$ export PATH="{OMPI_HOME}/bin:$PATH}"
$ export LD_LIBRARY_PATH="${CUDA_HOME}/lib64:$LD_LIBRARY_PATH}"
$ export LD_LIBRARY_PATH="${UCX_HOME}/lib64:$LD_LIBRARY_PATH}"
$ export LD_LIBRARY_PATH="${OMPI_HOME}/lib64:$LD_LIBRARY_PATH}"

2. Install UCX

$ git clone https://github.com/openucx/ucx
$ cd ucx/
$ cd ucx/
$ git clean -xfd
$ ./autogen.sh
$ mkdir build
$ cd build
$ ../configure --prefix=$UCX_HOME --enable-debug --with-cuda=$CUDA_HOME --enable-mt --disable-cma
$ make
$ sudo make install

3. Install MPI

$ git clone https://github.com/open-mpi/ompi.git
$ cd ompi/
$ git submodule update --init --recursive
$ sudo apt-get install -y pandoc
$ ./autogen.pl
$ mkdir build
$ cd build
$ ./configure --with-cuda=$CUDA_HOME --with-ucx=$UCX_HOME
$ make
$ sudo make install

4. Verified

$ ompi_info -a | grep "\-with\-cuda"
Configure command line: '--with-cuda=/usr/local/cuda' '--with-ucx=/usr/local/ucx'

Thanks.

1 Like

AastaLLL, first of all, thanks for providing this step-by-step.
I did try it and, with some patience, the thing compiled and installed on a Xavier NX. It does, however, require the explicit use of mpic++.openmpi and mpiexec.openmpi to compile/run, otherwise the plain mpic++/mpiexec with not find libs and complain about unresolved symbols.

When you compile/run the test prog below, what does it say for you?

#include <stdio.h>
#include "mpi.h"
#include "mpi-ext.h" /* Needed for CUDA-aware check */
 
int main(int argc, char *argv[])
{
    printf("Compile time check:\n");
#if defined(MPIX_CUDA_AWARE_SUPPORT) && MPIX_CUDA_AWARE_SUPPORT
    printf("This MPI library has CUDA-aware support.\n");
#elif defined(MPIX_CUDA_AWARE_SUPPORT) && !MPIX_CUDA_AWARE_SUPPORT
    printf("This MPI library does not have CUDA-aware support.\n");
#else
    printf("This MPI library cannot determine if there is CUDA-aware support.\n");
#endif /* MPIX_CUDA_AWARE_SUPPORT */
 
    printf("Run time check:\n");
#if defined(MPIX_CUDA_AWARE_SUPPORT)
    if (1 == MPIX_Query_cuda_support()) {
        printf("This MPI library has CUDA-aware support.\n");
    } else {
        printf("This MPI library does not have CUDA-aware support.\n");
    }
#else /* !defined(MPIX_CUDA_AWARE_SUPPORT) */
    printf("This MPI library cannot determine if there is CUDA-aware support.\n");
#endif /* MPIX_CUDA_AWARE_SUPPORT */
 
    return 0;
}

If it says that it is not CUDA-aware for compile/run-time, then it can be that these “mpi.h” and “mpi-ext.h” are the wrong ones.

Hi,

We can get the compiling time CUDA support but somehow MPIX_Query_cuda_support() returns false.
Let us check this further. Will share more information with you later.

$ mpic++ test.cpp -o test
$ ./test
Compile time check:
This MPI library has CUDA-aware support.
Run time check:
This MPI library does not have CUDA-aware support.

Thanks.

@AastaLLL, thanks for your time and patience looking into all of this.
From the OMPI discussions, it seems that this function only really queries in run-time if OMPI was build with cuda, it isn’t really testing the functionality. For your own reference, see this thread.

I am integrating both dGPUs and Tegras in my OMPI project and hoping to use the same cuda-aware code for host-device-host data copies. Let me do this, so I can accept your answer and we close this: I will write a minimal program to MPI send some data from one Tegra device to another Tegra device, and see if it is actually working despite what MPIX_Query_cuda_support() says.
I will update as soon as I have it tested.

@AastaLLL, I compiled and installed on my jetsons the OpenMPI/UCX as you described, then I wrote a small program to test the cuda-awareness by copying contents from the device in mpi_rank 0 to the device in mpi_rank 1. It won’t work, MPI complains about a bad address, which is solved when I restrict the copy from host memory to host memory. Please see below and I hope it serves for other people to try in their Tegra clusters:

#include <cstdio>
#include <mpi.h>

__global__ void print_val(float *data, const int LEN);

int main(int argc, char **argv)
	{
	const int	LENGTH		= 32;
	int			mpi_rank	= 0,
				mpi_size	= 0;
	float		host_data[LENGTH],
				*dev_data;

	MPI_Init(&argc, &argv);

	MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
	MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);

	if(mpi_rank == 0)
		for(int i = 0; i < LENGTH; i++)
			host_data[i] = (float) i * 0.5f;

	cudaMalloc((void **) &dev_data, LENGTH * sizeof(float));
	cudaMemset(&dev_data, LENGTH * sizeof(float), 0);

	if(mpi_rank == 0)
		{
		cudaMemcpy(dev_data, host_data, LENGTH * sizeof(float), cudaMemcpyHostToDevice);
		MPI_Send(dev_data, LENGTH, MPI_FLOAT, 1, 0, MPI_COMM_WORLD);
		}

	if(mpi_rank == 1)
		{
		MPI_Recv(dev_data, LENGTH, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
		//cudaMemcpy(dev_data, host_data, LENGTH * sizeof(float), cudaMemcpyHostToDevice); // uncomment if receiving in host_data
		print_val <<< 1, 1 >>> (dev_data, LENGTH);
		}

	cudaFree(dev_data);

	MPI_Finalize();
	
	return 0;
	}

__global__ void print_val(float *data, const int LEN)
	{
	printf("%.5f\n", data[LEN - 1]);
	}

I compiled it with the following lines:

nvcc -Xcompiler -Wall -c -o cuda_aware.o cuda_aware_test.cu -I/usr/lib/aarch64-linux-gnu/openmpi/include
mpic++.openmpi -o cuda_aware cuda_aware.o -I/usr/local/cuda-10.2/include -L/usr/local/cuda-10.2/lib64 -lcudart

Then I run with:

mpiexec.openmpi --hostfile ~/MPI_Nodes.txt --map-by ppr:1:node --mca btl_tcp_if_include 192.168.1.0/24 ./cuda_aware

MPI_Nodes.txt is my configuration file for MPI and it has the nodes of the cluster, and my ssh environment is already configured so the process will fire on the remote node without issues.

Notice that the nodes will have an array of floats, with rank 0 initializing it to some values and then all nodes will allocate space in the device. Rank 0 copies this initialized array to its device memory and tries to send it ro rank 1 device memory. If you want to receive in host memory, then uncomment the copy from host to device in rank 1 (but it won’t work either, because the bad address is when copying from the device memory in rank 0). In the end, rank 1 should print the last element from its device memory.

If you have a couple of jetsons ready to use in MPI, try all combinations you want, it will only work when copying from host memory to host memory (that is, no cuda-awareness).

Another comment I want to make, this time for the JetPack maintainers, is that in 4.6 you won’t be able to run any CUDA program unless it is done from docker. cuda-memcheck will say that all devices are busy or unavailable, and I could only fix this after reading this NV forums thread. I agree it should be fixed in next JP releases, just as it was in previous releases.

Let me know what you think.

Hi,

Thanks for sharing this information.
We are going to set up another Jetson to see the result from our side.

For the cuda-memcheck issue, this is a limitation in GPU profiling due to some internal security issues.
So I don’t think there will be a fix in the upcoming release.
Please run the cuda-memcheck with root authority to get the output.

Thanks.

@AastaLLL , thanks for your time assisting on this.
I will mark your previous reply as a solution though it looks like the cuda-aware data movement is not supported with OMPI, at least from these tests and unfortunately without official confirmation (or denial) from the OMPI forum.
At the same time I will post there and point to this topic here in case it is of someone else’s interest.

Hi,

Sorry for the late update.
We have confirmed that passing CUDA buffer is working with the rebuilt OpenMPI.

Below is the details experiment:

$ /usr/local/cuda-10.2/bin/nvcc --compiler-bindir /usr/local/ompi/bin/mpic++ test2.cu -o test
$ /usr/local/ompi/bin/mpiexec --np 2 --mca btl_tcp_if_include root@[IP] ./test
--------------------------------------------------------------------------
WARNING: An invalid value was given for btl_tcp_if_include.  This
value will be ignored.

  Local host: nvidia-desktop
  Value:      root@10.19.107.110
  Message:    Unknown interface name
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: An invalid value was given for btl_tcp_if_include.  This
value will be ignored.

  Local host: nvidia-desktop
  Value:      root@10.19.107.110
  Message:    Unknown interface name
--------------------------------------------------------------------------
rank 0: host->device and send
rank 1: receive
15.50000

test2.cu (1.2 KB)
test2.cu is basically your sample with some CUDA header and output log.
And we setup the password-less ssh with the instruction here.

Thanks.

1 Like

@AastaLLL , thanks for this great update.
What should I change in the OMPI compilation process you previously showed?

Hi,

We set up the device in the same way as mentioned in the previous comment.

Here is the script for your reference:
setup_OMPI.sh (1.4 KB)

$ ./setup_OMPI.sh
...
** Verify OpenMPI
  Configure command line: '--prefix=/usr/local/ompi' '--with-cuda=/usr/local/cuda' '--with-ucx=/usr/local/ucx' '--with-ucx-libdir=/usr/local/ucx/lib'

Thanks.

1 Like

I have changed the compilation and execution to my appropriate paths (I don’t have the /usr/local/ompi/bin as you, so I adjusted to /usr/local/bin) and it indeed runs. 15.5 is the right value to be printed by the second node though I get one failure (even though the value is printed):

rank 1: receive
--------------------------------------------------------------------------
The call to cuMemHostRegister(0x7f8590ae28, 4, 0) failed.
  Host:  EVH5150
  cuMemHostRegister return value:  801
  Registration cache:  checkmem
--------------------------------------------------------------------------
rank 0: host->device and send
15.50000

I will ask at OMPI forum what exactly causes this message to show, even with the operation actually happening and the right vale being printed.
I update as soon I get more info. Thanks a lot, @AastaLLL .

After getting information at the OMPI’s discussion previously referenced, the message comes because one or more nodes does not have I/O coherency, and pointed this Tegra documentation.

In my particular case it happens because of a Nano in the cluster, with the feature existing from Xavier onwards. It doesn’t prevent the cuda-aware copy from succeeding, as I mentioned the correct value was printed by the node receiving data in its GPU allocation.

@AastaLLL , I think it was a very fruitful discussion thanks to your assistance, and I keep my wish to see the cuda-aware ompi in future JetPacks. :)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.