Using NVSHMEM in Building Pytorch Operator

Hi, All

I want to build a Pytorch operator by using NVSHMEM.
Is there any way that I can do that? Because when we build a standalone NVSHMEM application written in pure C++ and CUDA C, we need to use nvshmrun -n 2 to run that application. While in Pytorch, how could we achieve the same goal?

Thanks!

Daniel, NVSHMEM can be initialized using MPI as well. It can use the same bootstrap mechanism as you are using to run the MPI backend. The way to initialize NVSHMEM using MPI is shown here: NVIDIA OpenSHMEM Library (NVSHMEM) Documentation — NVSHMEM 2.6.0 documentation

If you are writing an NVSHMEM backend, you can use the above example code to initialize NVSHMEM.

Also, we are curious to know what communication primitives you are looking at (alltoall, allreduce, etc.)? And what is the use case or the project you are trying to use NVSHMEM for?

Thanks for your reply!

I am currently testing the potential usage of nvshmem in graph processing, which essentially is irregular memory access application. And I want to build a Pytorch operator for that just like the cuGraph.

I also encounter another problem when compiling my program on a HPC (DGX with 4 Tesla V100 32GB). Because I don’t have the root access for those servers, I have to put the openmpi on my user home directory /home/user/opemmpi and I use the command from the official website.

nvcc -rdc=true -ccbin g++ -arch=$NVCC_GENCODE \
                            -I$NVSHMEM_HOME/include \
                            -Iinclude \
                            src/app.cu \
                            -o app \
                            -L$NVSHMEM_HOME/lib \
                            -lnvshmem \
                            -lcuda\
                            -Xcompiler \
                            -pthread \
                            -L$MPI_HOME/lib \
                            -lmpi_cxx \
                            -lmpi

However, it always fails to link with mpi_cxx while it can successfully link with mpi, and it shows me like

/usr/bin/ld: cannot find -lmpi_cxx

Could you please help me with this?
Thanks a lot!

It’s possible that the OpenMPI version you are using does not come with CPP bindings. Does the lib directory in openmpi installation have libmpi_cxx?

I am using the openmpi (openmpi-4.1.0.tar.gz) download from Open MPI: Version 4.1, I am not sure whether this version has the cxxlib to link, or maybe I should download the other two openmpi-4.1.0-1.src.rpm and openmpi-4.1.0.tar.bz2?

Do you need cpp bindings? If not, you can just remove -lmpi_cxx. I think lately OpenMPI does not build (or maybe even support) them by default.

when I remove the -lmpi_cxx flag it shows me the compilation error like

undefined reference to `ompi_mpi_cxx_op_intercept'
undefined reference to `MPI::Comm::Comm()'

I think I put the right path for MPI_HOME.

I am not sure why this error is coming if OpenMPI has not been built with CPP bindings. Can you try building OpenMPI with CPP bindings using the --enable-cxx option at configure time.