Raise error when link nvshmem in my application

I want to use nvshmem in my application to transport data between A100 and A800 with Pcie.
I use the newest Nvidia HPC Contianer,which created by nvcr.io/nvidia/nvhpc:23.11-devel-cuda_multi-ubuntu20.04.My nvidia driver version is 515.105.01.
but It raise errors when I link nvshmem to my application. I use the raw nvshmem in the container, just add the cudnn which is not in it first.
my cmake just use
target_link_libraries(pipeGnn nvshmem cuda MPI::MPI_CXX cublas cudnn nvidia-ml)
when I make, It raise:

nvlink error : Undefined reference to ‘_Z21nvshmemi_transfer_rmaIL13threadgroup_t0EL13nvshmemi_op_t4EEvPvS2_mi’ in ‘CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o’
nvlink error : Undefined reference to ‘_Z21nvshmemi_transfer_rmaIL13threadgroup_t1EL13nvshmemi_op_t4EEvPvS2_mi’ in ‘CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o’
nvlink error : Undefined reference to ‘_Z25nvshmemi_transfer_rma_nbiIL13threadgroup_t1EL13nvshmemi_op_t4EEvPvS2_mi’ in ‘CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o’
nvlink error : Undefined reference to ‘_Z21nvshmemi_transfer_rmaIL13threadgroup_t2EL13nvshmemi_op_t4EEvPvS2_mi’ in ‘CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o’
nvlink error : Undefined reference to ‘nvshmemi_device_state_d’ in ‘CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o’
make[2]: *** [CMakeFiles/pipeGnn.dir/build.make:85: CMakeFiles/pipeGnn.dir/cmake_device_link.o] Error 255
make[1]: *** [CMakeFiles/Makefile2:76: CMakeFiles/pipeGnn.dir/all] Error 2
make: *** [Makefile:84: all] Error 2

my application used mpi to start the nvshmem.And the host machine does not support infiniband.
How to fix this?I am sure my coda can work because I can run it in another container with nvshmem2.0.3 and cuda11.3.
by the way, I have compile it in old version hpc23.9,It can compile correct, but when I begin the application,It also raise error and terminate.

Hi 728882065,

I’ve not encountered this error myself so am not sure what’s wrong. The message implies that NVShem isn’t getting linked, there’s a CUDA version mismatch, or the C++ name managing is different from what’s referenced in the library.

What compiler are you using? nvcc or nvc++?
What is the full link line being used? (you may need to add VERBOSE=1 to your make to see the output).

If you’re using nvc++, we do have a convenience flag “-cudalib=cublas,nvshmem” which I recommend using on the compile and link instead of adding the cublas and nvshmem include path and libraries directly. The compiler will implicitly add these and would ensure there’s no CUDA version mismatch.

-Mat

I think it is because the cuda version error. My nvidia driver version is just surport cuda version 11.8,but the default cuda version is 12.3. I change the link of cuda/nccl/nvshmem in /opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs
,/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/cuda and /opt/nvidia/hpc_sdk/Linux_x86_64/23.11/math_libs.
And I change the default cuda version in /opt/nvidia/hpc_sdk/Linux_x86_64/23.11/cmake/NVHPCConfig.cmake.
However, the errors still exists. I try to add -cudalib,but It use nvcc to compile cuda code,I have no idea to change it.
I set CMAKE_VERBOSE_MAKEFILE ON and recompile.
It shows

-- The CXX compiler identification is PGI 23.11.0
-- The CUDA compiler identification is NVIDIA 11.8.89
-- Check for working CXX compiler: /opt/nvidia/hpc_sdk/Linux_x86_64/23.11/compilers/bin/nvc++
-- Check for working CXX compiler: /opt/nvidia/hpc_sdk/Linux_x86_64/23.11/compilers/bin/nvc++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working CUDA compiler: /opt/nvidia/hpc_sdk/Linux_x86_64/23.11/compilers/bin/nvcc
-- Check for working CUDA compiler: /opt/nvidia/hpc_sdk/Linux_x86_64/23.11/compilers/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Found OpenMP_CXX: -mp  
-- Found OpenMP: TRUE   
-- Found MPI_CXX: /opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/11.8/hpcx/hpcx-2.14/ompi/lib/libmpi.so (found version "3.1") 
-- Found MPI: TRUE (found version "3.1")  
-- NVHPC_CUDA_VERSION not specified.
-- Default CUDA version selected: 11.8
-- Configuring done
-- Generating done
-- Build files have been written to: /pipegnn/build1
Scanning dependencies of target pipeGnn
make[2]: Leaving directory '/pipegnn/build1'
make -f CMakeFiles/pipeGnn.dir/build.make CMakeFiles/pipeGnn.dir/build
make[2]: Entering directory '/pipegnn/build1'
[ 33%] Building CUDA object CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o
/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/compilers/bin/nvcc   -I/pipegnn/include -I/pipegnn/cache/include -I/MGG/local/cudnn-v8.2/include -isystem=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/11.8/hpcx/hpcx-2.14/ompi/include -isystem=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/11.8/hpcx/hpcx-2.14/ompi/include/openmpi -isystem=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/11.8/hpcx/hpcx-2.14/ompi/include/openmpi/opal/mca/hwloc/hwloc201/hwloc/include -isystem=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/11.8/hpcx/hpcx-2.14/ompi/include/openmpi/opal/mca/event/libevent2022/libevent -isystem=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/11.8/hpcx/hpcx-2.14/ompi/include/openmpi/opal/mca/event/libevent2022/libevent/include  -Xcompiler -pthread -rdc=true -ccbin g++ -lineinfo -Xcompiler -pthread -x cu -dc /pipegnn/src/mgg_full_cache.cu -o CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o
[ 66%] Linking CUDA device code CMakeFiles/pipeGnn.dir/cmake_device_link.o
/usr/bin/cmake -E cmake_link_script CMakeFiles/pipeGnn.dir/dlink.txt --verbose=1
/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/compilers/bin/nvcc   -Xcompiler=-fPIC -Wno-deprecated-gpu-targets -shared -dlink CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o -o CMakeFiles/pipeGnn.dir/cmake_device_link.o   -L/MGG/local/cudnn-v8.2/lib64  -lnvshmem -lcuda -lcublas -lcudnn -lgomp -lnvidia-ml  
nvlink error   : Undefined reference to '_Z21nvshmemi_transfer_rmaIL13threadgroup_t0EL13nvshmemi_op_t4EEvPvS2_mi' in 'CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o'
nvlink error   : Undefined reference to '_Z21nvshmemi_transfer_rmaIL13threadgroup_t1EL13nvshmemi_op_t4EEvPvS2_mi' in 'CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o'
nvlink error   : Undefined reference to '_Z25nvshmemi_transfer_rma_nbiIL13threadgroup_t1EL13nvshmemi_op_t4EEvPvS2_mi' in 'CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o'
nvlink error   : Undefined reference to '_Z21nvshmemi_transfer_rmaIL13threadgroup_t2EL13nvshmemi_op_t4EEvPvS2_mi' in 'CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o'
nvlink error   : Undefined reference to 'nvshmemi_device_state_d' in 'CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o'
make[2]: *** [CMakeFiles/pipeGnn.dir/build.make:88: CMakeFiles/pipeGnn.dir/cmake_device_link.o] Error 255
make[2]: Leaving directory '/pipegnn/build1'
make[1]: *** [CMakeFiles/Makefile2:79: CMakeFiles/pipeGnn.dir/all] Error 2
make[1]: Leaving directory '/pipegnn/build1'
make: *** [Makefile:87: all] Error 2

It is possible that this error raised by cmake version?because I used cmake 3.27 in nvhpc container 23.9 and It can compile correctly. But when I use this image (instead of the image I created with the modified container) to create a new container, it becomes ineffective. I’m not sure what modifications I made to this container in the past.
when I use find_library,I can find nvshmem correctly.but still can not link it.
in my program, I include

#include <nvshmem.h>
#include <nvshmemx.h>

I search nvshmemi_device_state_d ,and found it in /opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/11.8/nvshmem/include/device/nvshmemi_common_device_defines.cuh,which is included by nvshmem.h

#if defined(__CUDACC_RDC__)
#define EXTERN_CONSTANT extern __constant__
#else
#define EXTERN_CONSTANT static __constant__
#endif
EXTERN_CONSTANT nvshmemi_device_state_t nvshmemi_device_state_d;
#undef EXTERN_CONSTANT
#endif

Looking forward to your reply!

The error disappear after I update cmake to 3.27.9.I do not understand what happen

when I run the application, It raise

[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[5770da4ecf14:00487] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[5770da4ecf14:00486] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
d_buff_1: 0.280 GB
nodesPerPE: 116483, dim: 602
Preproc (ms): 3053.317
d_buff_1: 0.280 GB
nodesPerPE: 116483, dim: 602
/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:215: init failed for remote transport: ibrc
/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:215: init failed for remote transport: ibrc
/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:register_state_ptr:93: Redundant common pointer registered, ignoring.

/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:register_state_ptr:93: Redundant common pointer registered, ignoring.

PE-1, Total (ms): 592.306
PE-0, Total (ms): 987.820
MPI time (ms) 494.359
/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:1051: non-zero status: 1 Invalid context pointer passed to nvshmemx_host_finalize.

/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:nvshmemx_host_finalize:1128: aborting due to error in nvshmem_finalize 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:1051: non-zero status: 1 Invalid context pointer passed to nvshmemx_host_finalize.

/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:nvshmemx_host_finalize:1128: aborting due to error in nvshmem_finalize 

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[30047,1],1]
  Exit code:    255
--------------------------------------------------------------------------

and it exit.
that is why

Hello.I enable the Infiniband in my machine.the error disappeared,
but It raise another error,It seems raised by cublas.
I use cublasSgemm to multify two matrix. when I use too large dim, it will raise error 13.
how to set to let the cublas can compute big tensor?

HPC-X enables HCOLL by default and HCOLL requires an Infiniband adapter which is why it failed before. FYI, you can disable HCOLL by adding “-mca coll_hcoll_enable 0” to the mpirun command. Example: “mpirun -np 4 -mca coll_hcoll_enable 0 ./a.out”.

I use cublasSgemm to multify two matrix. when I use too large dim, it will raise error 13.
how to set to let the cublas can compute big tensor?

What is “too large”?

Perhaps you need to switch to using the 64-bit interface?

Does your GPU have enough memory to hold the matrices?

Are you running under WSL2? If so, then the Windows CUDA driver could be timing out.

I just run it in docker container in Linux.
I write a demo to test:


#include <iostream>
#include <stdio.h>
#include <ctime>
#include <algorithm>

#include <mpi.h>
#include <nvshmem.h>
#include <nvshmemx.h>
#include <cublas_v2.h>
#include <cublas_api.h>
#include <cudaProfiler.h>
#include "cublas_utils.h"
using nidType = int;


using namespace std;

int main(int argc, char* argv[]){
    cudaStream_t stream;
    nvshmemx_init_attr_t attr;
    int rank, nranks;
    MPI_Comm mpi_comm = MPI_COMM_WORLD;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &nranks);
    attr.mpi_comm = &mpi_comm;

    // Set up NVSHMEM device.
    nvshmemx_init_attr(NVSHMEMX_INIT_WITH_MPI_COMM, &attr);
    int mype_node = nvshmem_team_my_pe(NVSHMEMX_TEAM_NODE);
    cudaSetDevice(mype_node);
    cudaStreamCreate(&stream);

    int hidden_dim = 16;
    int dim = 602;
    int num_nodes = 10000000;
    int ldx, ldw, ldout;
    float *d_W, *d_out;
    float alpha, beta;
    cublasOperation_t transa, transb;
    cublasHandle_t cublasH;

    alpha = 1.0f;
    beta = 0.0;

    transa = CUBLAS_OP_N;
    transb = CUBLAS_OP_N;
    cublasH = NULL;

    CUBLAS_CHECK(cublasCreate(&cublasH));

    int max_dim = max(dim, hidden_dim);
    //d_W = (float *) nvshmem_malloc (k * m  * sizeof(float));
    //dW:hidden_dim*dim
    //d_out as d_B:dim*num_nodes
    //d_out: hidden_dim * num_nodes
    CUDA_CHECK(cudaMalloc((void **)&d_W, dim * hidden_dim * sizeof(float)));
    CUDA_CHECK(cudaMemset(d_W, 0, dim * hidden_dim * sizeof(float)));
    CUDA_CHECK(cudaMalloc((void **)&d_out, num_nodes * max_dim * sizeof(float)));
    CUDA_CHECK(cudaMemset(d_out, 0, num_nodes * max_dim * sizeof(float)));
    ldx = dim, ldw = hidden_dim, ldout = hidden_dim;
    MPI_Barrier(MPI_COMM_WORLD);
    CUBLAS_CHECK(cublasSgemm(cublasH, transa, transb, hidden_dim, num_nodes, dim,
                             &alpha, d_W, ldw, d_out, ldx, &beta,
                             d_out, ldout));

    cudaFree(d_W);
    cudaFree(d_out);

    nvshmem_finalize();
    MPI_Finalize();

    return 0;
}

It is a demo to implement: d_out = d_W * d_out; like a forward in deep learning.
I change num_nodes to change d_out’s size.
when I set num_nodes as 2500,It can run correctly.
however, I change num_nodes to 10000000,It will raise :

cublas error 13 at /pipegnn/src/mgg_test.cu:65
terminate called after throwing an instance of 'std::runtime_error'
  what():  cublas error
[powerleader:04808] *** Process received signal ***
[powerleader:04808] Signal: Aborted (6)
[powerleader:04808] Signal code:  (-6)
[powerleader:04808] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7fdda43a1420]
[powerleader:04808] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fdda3e8a00b]
[powerleader:04808] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fdda3e69859]
[powerleader:04808] [ 3] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e8d1)[0x7fdda42438d1]
[powerleader:04808] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c)[0x7fdda424f37c]
[powerleader:04808] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3e7)[0x7fdda424f3e7]
[powerleader:04808] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa699)[0x7fdda424f699]
[powerleader:04808] [ 7] build1/mggTest(+0x1913d)[0x56186765a13d]
[powerleader:04808] [ 8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fdda3e6b083]
[powerleader:04808] [ 9] build1/mggTest(+0x1834e)[0x56186765934e]
[powerleader:04808] *** End of error message ***
cublas error 13 at /pipegnn/src/mgg_test.cu:65
terminate called after throwing an instance of 'std::runtime_error'
  what():  cublas error
[powerleader:04807] *** Process received signal ***
[powerleader:04807] Signal: Aborted (6)
[powerleader:04807] Signal code:  (-6)
[powerleader:04807] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f286ae05420]
[powerleader:04807] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f286a8ee00b]
[powerleader:04807] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f286a8cd859]
[powerleader:04807] [ 3] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e8d1)[0x7f286aca78d1]
[powerleader:04807] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c)[0x7f286acb337c]
[powerleader:04807] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3e7)[0x7f286acb33e7]
[powerleader:04807] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa699)[0x7f286acb3699]
[powerleader:04807] [ 7] build1/mggTest(+0x1913d)[0x55881b62e13d]
[powerleader:04807] [ 8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f286a8cf083]
[powerleader:04807] [ 9] build1/mggTest(+0x1834e)[0x55881b62d34e]
[powerleader:04807] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node powerleader exited on signal 6 (Aborted)

I run it in A800 80G and A100 80G. and there has no other progam in GPUs.and I follow the nvidia-smi when running.I noticed that the max Gpu memory used just be 7449MB。
what is the reason?

Looks to be an integer overflow as the value of “num_nodes * max_dim” is greater than INT_MAX. To fix, change the declaration of “num_nodes” to be an 64-bit integer, i.e. “long”, “int64_t” or “size_t”. This will promote the size computation to be 64-bits and allow for sizes above 2GB.

Hope this helps,
Mat

1 Like

thanks!
I didn’t expect it to be for this reason.I am not familiar enough with Cpp.
Your reply has been of great help to me. Thank you very much for your help.

my nvshmem suddenly fails to finalize.
I use the Attribute-Based Initialization Example in Examples — NVSHMEM 2.10.1 documentation to test it.
It rasie:

/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:register_state_ptr:93: Redundant common pointer registered, ignoring.

/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:register_state_ptr:93: Redundant common pointer registered, ignoring.

1: received message 0
0: received message 1
/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:1051: non-zero status: 1 Invalid context pointer passed to nvshmemx_host_finalize.

/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:nvshmemx_host_finalize:1128: /dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:1051: non-zero status: 1 Invalid context pointer passed to nvshmemx_host_finalize.

/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:nvshmemx_host_finalize:1128: aborting due to error in nvshmem_finalize 

aborting due to error in nvshmem_finalize 

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[2802,1],1]
  Exit code:    255
--------------------------------------------------------------------------

While I try to do my best, since I’m not an expert in using NVShmem, I’d suggest you ask this question over on the GPU-Accelerated Libraries forum: GPU-Accelerated Libraries - NVIDIA Developer Forums

ok. thanks you for your help.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.