Failure in installation of nvshmem

I wangt to install nvshmem2.10 on Ubuntu 16.04 with 8 *P100 ,and it does not have infiniband.
I follow the guidance https://docs.nvidia.com/nvshmem/release-notes-install-guide/install-guide/nvshmem-install-proc.html,and this is the makefile:

> #
> # Copyright (c) 2016-2021, NVIDIA CORPORATION. All rights reserved.
> #
> # See COPYRIGHT for license information
> #
> 
> # Define this variable for the Include Variable in common.mk
> mkfile_path := $(abspath $(firstword $(MAKEFILE_LIST)))
> mkfile_dir := $(dir $(mkfile_path))
> 
> # External dependencies
> include common.mk
> include version.mk
> 
> # Build/install location
> # location where the build will be installed
> NVSHMEM_PREFIX ?= /home/xxx/nvshmem/
> # build location
> NVSHMEM_BUILDDIR ?= $(abspath build)
> # MPI/SHMEM Support
> 
> NVSHMEM_IBRC_SUPPORT=0
> NVSHMEM_IBGDA_SUPPORT=0
> NVSHMEM_IBGDA_SUPPORT_GPUMEM_ONLY=0
> # GDRCopy install/headers location
> GDRCOPY_HOME ?= /home/xxx/mpi/gdrcopy/gdrcopy-2.3
> # NCCL install/headers location
> NCCL_HOME ?= /usr/local/nccl
> NVSHMEM_USE_NCCL ?= 0
> # Whether to build with MPI support. If yes, MPI_HOME should be set
> NVSHMEM_USE_GDRCOPY ?= 1
> # Include support for PMIx as the process manager interface
> PMIX_HOME ?= /usr
> # One of the below can be set to 1 to override the default PMI
> NVSHMEM_DEFAULT_PMIX ?= 0
> NVSHMEM_DEFAULT_PMI2 ?= 0
> 
> # This can be set to override the default remote transport.
> NVSHMEM_DEFAULT_UCX ?= 0
> 
> # NVSHMEM internal features
> NVSHMEM_TRACE ?= 0
> NVSHMEM_NVTX ?= 1
> 
> NVSHMEM_DISABLE_COLL_POLL ?= 1
> NVSHMEM_GPU_COLL_USE_LDST ?= 0
> # Timeout if stuck for long in wait loops in device
> NVSHMEM_TIMEOUT_DEVICE_POLLING ?= 0
> # Use dlmalloc (instead of custom_malloc) as heap
> # allocator (will work only if not using CUDA VMM)
> NVSHMEM_USE_DLMALLOC ?= 0
> 
> CXXFLAGS+=-std=c++11
> NVSHMEM_UCX_SUPPORT=1
> UCX_HOME=/home/xxx/ucx

In the end ,it fails whith:

> In file included from src/host/mem/mem.cpp:39:0:
> src/include/common/nvshmem_common_ibgda.h:12:31: fatal error: infiniband/mlx5dv.h: No such file or directory
> compilation terminated.
> Makefile:676: recipe for target '/home/xxx/nvshmem_src_2.10.1-3/build/obj_nvshmem/host/mem/mem.o' failed
> make: *** [/home/xxx/nvshmem_src_2.10.1-3/build/obj_nvshmem/host/mem/mem.o] Error 1

What should i do?

Hello, this is a bug in NVSHMEM. Thanks for reporting it.

The header src/include/common/nvshmem_common_ibgda.h should be guarded by a macro, but it isn’t.
You can work around this in your existing repo by editing src/host/mem/mem.cpp.

#include "common/nvshmem_common_ibgda.h"

to be

#ifdef NVSHMEM_IBGDA_SUPPORT
#include "common/nvshmem_common_ibgda.h"
#endif

I am submitting an MR for this which will be merged into our next release (not 2.11 which is already baked, but 2.11+) to fix this.

Thanks again for the report.

As a sidenote - I saw that you are building for Pascal.

As of 2.7, we no longer officially support Pascal. You are welcome to give it a go. There is no specific reason it shouldn’t work right now - in fact I occasionally use Pascal for dev work still.

But in order to build for that arch, you will want to set NVCC_GENCODE appropriately.

Also, heads up - in 2.11+ we will be removing the Makefiles from the repo. Makefile support was deprecated in 2.9.

Thanks a lot. There’s another question, since I want to run my program on a remote server with aarch64 cpu and it doesnot support nvshmem on cuda11.8. I wish to overlap communication(allreduce) and calculation ,but nccl allreduce has an impact on calculation,and mpi allreuce is blocking, what other tools will you recommandation.

You are correct that NVSHMEM is not supported on CUDA < 12. We don’t build against or test on the older CUDA versions.

You are welcome to compile against the older CUDA (I have never tried, and am not sure it works) on aarch64 for evaluation.

If you have a request for this support - please send an e-mail to nvshmem@nvidia.com with your use case and we can discuss offline.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.