Raise error when link nvshmem in my application

728882065 · December 1, 2023, 2:12pm

I want to use nvshmem in my application to transport data between A100 and A800 with Pcie.
I use the newest Nvidia HPC Contianer,which created by nvcr.io/nvidia/nvhpc:23.11-devel-cuda_multi-ubuntu20.04.My nvidia driver version is 515.105.01.
but It raise errors when I link nvshmem to my application. I use the raw nvshmem in the container, just add the cudnn which is not in it first.
my cmake just use
target_link_libraries(pipeGnn nvshmem cuda MPI::MPI_CXX cublas cudnn nvidia-ml)
when I make, It raise:

nvlink error : Undefined reference to ‘_Z21nvshmemi_transfer_rmaIL13threadgroup_t0EL13nvshmemi_op_t4EEvPvS2_mi’ in ‘CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o’
nvlink error : Undefined reference to ‘_Z21nvshmemi_transfer_rmaIL13threadgroup_t1EL13nvshmemi_op_t4EEvPvS2_mi’ in ‘CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o’
nvlink error : Undefined reference to ‘_Z25nvshmemi_transfer_rma_nbiIL13threadgroup_t1EL13nvshmemi_op_t4EEvPvS2_mi’ in ‘CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o’
nvlink error : Undefined reference to ‘_Z21nvshmemi_transfer_rmaIL13threadgroup_t2EL13nvshmemi_op_t4EEvPvS2_mi’ in ‘CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o’
nvlink error : Undefined reference to ‘nvshmemi_device_state_d’ in ‘CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o’
make[2]: *** [CMakeFiles/pipeGnn.dir/build.make:85: CMakeFiles/pipeGnn.dir/cmake_device_link.o] Error 255
make[1]: *** [CMakeFiles/Makefile2:76: CMakeFiles/pipeGnn.dir/all] Error 2
make: *** [Makefile:84: all] Error 2

my application used mpi to start the nvshmem.And the host machine does not support infiniband.
How to fix this?I am sure my coda can work because I can run it in another container with nvshmem2.0.3 and cuda11.3.
by the way, I have compile it in old version hpc23.9,It can compile correct, but when I begin the application,It also raise error and terminate.

MatColgrove · December 1, 2023, 7:00pm

Hi 728882065,

I’ve not encountered this error myself so am not sure what’s wrong. The message implies that NVShem isn’t getting linked, there’s a CUDA version mismatch, or the C++ name managing is different from what’s referenced in the library.

What compiler are you using? nvcc or nvc++?
What is the full link line being used? (you may need to add VERBOSE=1 to your make to see the output).

If you’re using nvc++, we do have a convenience flag “-cudalib=cublas,nvshmem” which I recommend using on the compile and link instead of adding the cublas and nvshmem include path and libraries directly. The compiler will implicitly add these and would ensure there’s no CUDA version mismatch.

-Mat

728882065 · December 2, 2023, 4:18am

I think it is because the cuda version error. My nvidia driver version is just surport cuda version 11.8,but the default cuda version is 12.3. I change the link of cuda/nccl/nvshmem in /opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs
,/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/cuda and /opt/nvidia/hpc_sdk/Linux_x86_64/23.11/math_libs.
And I change the default cuda version in /opt/nvidia/hpc_sdk/Linux_x86_64/23.11/cmake/NVHPCConfig.cmake.
However, the errors still exists. I try to add -cudalib,but It use nvcc to compile cuda code,I have no idea to change it.
I set CMAKE_VERBOSE_MAKEFILE ON and recompile.
It shows

-- The CXX compiler identification is PGI 23.11.0
-- The CUDA compiler identification is NVIDIA 11.8.89
-- Check for working CXX compiler: /opt/nvidia/hpc_sdk/Linux_x86_64/23.11/compilers/bin/nvc++
-- Check for working CXX compiler: /opt/nvidia/hpc_sdk/Linux_x86_64/23.11/compilers/bin/nvc++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working CUDA compiler: /opt/nvidia/hpc_sdk/Linux_x86_64/23.11/compilers/bin/nvcc
-- Check for working CUDA compiler: /opt/nvidia/hpc_sdk/Linux_x86_64/23.11/compilers/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Found OpenMP_CXX: -mp  
-- Found OpenMP: TRUE   
-- Found MPI_CXX: /opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/11.8/hpcx/hpcx-2.14/ompi/lib/libmpi.so (found version "3.1") 
-- Found MPI: TRUE (found version "3.1")  
-- NVHPC_CUDA_VERSION not specified.
-- Default CUDA version selected: 11.8
-- Configuring done
-- Generating done
-- Build files have been written to: /pipegnn/build1

Scanning dependencies of target pipeGnn
make[2]: Leaving directory '/pipegnn/build1'
make -f CMakeFiles/pipeGnn.dir/build.make CMakeFiles/pipeGnn.dir/build
make[2]: Entering directory '/pipegnn/build1'
[ 33%] Building CUDA object CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o
/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/compilers/bin/nvcc   -I/pipegnn/include -I/pipegnn/cache/include -I/MGG/local/cudnn-v8.2/include -isystem=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/11.8/hpcx/hpcx-2.14/ompi/include -isystem=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/11.8/hpcx/hpcx-2.14/ompi/include/openmpi -isystem=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/11.8/hpcx/hpcx-2.14/ompi/include/openmpi/opal/mca/hwloc/hwloc201/hwloc/include -isystem=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/11.8/hpcx/hpcx-2.14/ompi/include/openmpi/opal/mca/event/libevent2022/libevent -isystem=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/11.8/hpcx/hpcx-2.14/ompi/include/openmpi/opal/mca/event/libevent2022/libevent/include  -Xcompiler -pthread -rdc=true -ccbin g++ -lineinfo -Xcompiler -pthread -x cu -dc /pipegnn/src/mgg_full_cache.cu -o CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o
[ 66%] Linking CUDA device code CMakeFiles/pipeGnn.dir/cmake_device_link.o
/usr/bin/cmake -E cmake_link_script CMakeFiles/pipeGnn.dir/dlink.txt --verbose=1
/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/compilers/bin/nvcc   -Xcompiler=-fPIC -Wno-deprecated-gpu-targets -shared -dlink CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o -o CMakeFiles/pipeGnn.dir/cmake_device_link.o   -L/MGG/local/cudnn-v8.2/lib64  -lnvshmem -lcuda -lcublas -lcudnn -lgomp -lnvidia-ml  
nvlink error   : Undefined reference to '_Z21nvshmemi_transfer_rmaIL13threadgroup_t0EL13nvshmemi_op_t4EEvPvS2_mi' in 'CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o'
nvlink error   : Undefined reference to '_Z21nvshmemi_transfer_rmaIL13threadgroup_t1EL13nvshmemi_op_t4EEvPvS2_mi' in 'CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o'
nvlink error   : Undefined reference to '_Z25nvshmemi_transfer_rma_nbiIL13threadgroup_t1EL13nvshmemi_op_t4EEvPvS2_mi' in 'CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o'
nvlink error   : Undefined reference to '_Z21nvshmemi_transfer_rmaIL13threadgroup_t2EL13nvshmemi_op_t4EEvPvS2_mi' in 'CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o'
nvlink error   : Undefined reference to 'nvshmemi_device_state_d' in 'CMakeFiles/pipeGnn.dir/src/mgg_full_cache.cu.o'
make[2]: *** [CMakeFiles/pipeGnn.dir/build.make:88: CMakeFiles/pipeGnn.dir/cmake_device_link.o] Error 255
make[2]: Leaving directory '/pipegnn/build1'
make[1]: *** [CMakeFiles/Makefile2:79: CMakeFiles/pipeGnn.dir/all] Error 2
make[1]: Leaving directory '/pipegnn/build1'
make: *** [Makefile:87: all] Error 2

It is possible that this error raised by cmake version?because I used cmake 3.27 in nvhpc container 23.9 and It can compile correctly. But when I use this image (instead of the image I created with the modified container) to create a new container, it becomes ineffective. I’m not sure what modifications I made to this container in the past.
when I use find_library,I can find nvshmem correctly.but still can not link it.
in my program, I include

#include <nvshmem.h>
#include <nvshmemx.h>

I search nvshmemi_device_state_d ,and found it in /opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/11.8/nvshmem/include/device/nvshmemi_common_device_defines.cuh，which is included by nvshmem.h

#if defined(__CUDACC_RDC__)
#define EXTERN_CONSTANT extern __constant__
#else
#define EXTERN_CONSTANT static __constant__
#endif
EXTERN_CONSTANT nvshmemi_device_state_t nvshmemi_device_state_d;
#undef EXTERN_CONSTANT
#endif

Looking forward to your reply!

728882065 · December 2, 2023, 1:09pm

The error disappear after I update cmake to 3.27.9.I do not understand what happen

728882065 · December 2, 2023, 1:13pm

when I run the application, It raise

[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[5770da4ecf14:00487] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[5770da4ecf14:00486] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
d_buff_1: 0.280 GB
nodesPerPE: 116483, dim: 602
Preproc (ms): 3053.317
d_buff_1: 0.280 GB
nodesPerPE: 116483, dim: 602
/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:215: init failed for remote transport: ibrc
/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:215: init failed for remote transport: ibrc
/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:register_state_ptr:93: Redundant common pointer registered, ignoring.

/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:register_state_ptr:93: Redundant common pointer registered, ignoring.

PE-1, Total (ms): 592.306
PE-0, Total (ms): 987.820
MPI time (ms) 494.359
/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:1051: non-zero status: 1 Invalid context pointer passed to nvshmemx_host_finalize.

/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:nvshmemx_host_finalize:1128: aborting due to error in nvshmem_finalize 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:1051: non-zero status: 1 Invalid context pointer passed to nvshmemx_host_finalize.

/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:nvshmemx_host_finalize:1128: aborting due to error in nvshmem_finalize 

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[30047,1],1]
  Exit code:    255
--------------------------------------------------------------------------

and it exit.
that is why

728882065 · December 3, 2023, 11:06am

Hello.I enable the Infiniband in my machine.the error disappeared,
but It raise another error,It seems raised by cublas.
I use cublasSgemm to multify two matrix. when I use too large dim, it will raise error 13.
how to set to let the cublas can compute big tensor?

MatColgrove · December 4, 2023, 5:05pm

HPC-X enables HCOLL by default and HCOLL requires an Infiniband adapter which is why it failed before. FYI, you can disable HCOLL by adding “-mca coll_hcoll_enable 0” to the mpirun command. Example: “mpirun -np 4 -mca coll_hcoll_enable 0 ./a.out”.

I use cublasSgemm to multify two matrix. when I use too large dim, it will raise error 13.
how to set to let the cublas can compute big tensor?

What is “too large”?

Perhaps you need to switch to using the 64-bit interface?

Does your GPU have enough memory to hold the matrices?

Are you running under WSL2? If so, then the Windows CUDA driver could be timing out.

728882065 · December 5, 2023, 2:13pm

I just run it in docker container in Linux.
I write a demo to test:


#include <iostream>
#include <stdio.h>
#include <ctime>
#include <algorithm>

#include <mpi.h>
#include <nvshmem.h>
#include <nvshmemx.h>
#include <cublas_v2.h>
#include <cublas_api.h>
#include <cudaProfiler.h>
#include "cublas_utils.h"
using nidType = int;


using namespace std;

int main(int argc, char* argv[]){
    cudaStream_t stream;
    nvshmemx_init_attr_t attr;
    int rank, nranks;
    MPI_Comm mpi_comm = MPI_COMM_WORLD;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &nranks);
    attr.mpi_comm = &mpi_comm;

    // Set up NVSHMEM device.
    nvshmemx_init_attr(NVSHMEMX_INIT_WITH_MPI_COMM, &attr);
    int mype_node = nvshmem_team_my_pe(NVSHMEMX_TEAM_NODE);
    cudaSetDevice(mype_node);
    cudaStreamCreate(&stream);

    int hidden_dim = 16;
    int dim = 602;
    int num_nodes = 10000000;
    int ldx, ldw, ldout;
    float *d_W, *d_out;
    float alpha, beta;
    cublasOperation_t transa, transb;
    cublasHandle_t cublasH;

    alpha = 1.0f;
    beta = 0.0;

    transa = CUBLAS_OP_N;
    transb = CUBLAS_OP_N;
    cublasH = NULL;

    CUBLAS_CHECK(cublasCreate(&cublasH));

    int max_dim = max(dim, hidden_dim);
    //d_W = (float *) nvshmem_malloc (k * m  * sizeof(float));
    //dW:hidden_dim*dim
    //d_out as d_B:dim*num_nodes
    //d_out: hidden_dim * num_nodes
    CUDA_CHECK(cudaMalloc((void **)&d_W, dim * hidden_dim * sizeof(float)));
    CUDA_CHECK(cudaMemset(d_W, 0, dim * hidden_dim * sizeof(float)));
    CUDA_CHECK(cudaMalloc((void **)&d_out, num_nodes * max_dim * sizeof(float)));
    CUDA_CHECK(cudaMemset(d_out, 0, num_nodes * max_dim * sizeof(float)));
    ldx = dim, ldw = hidden_dim, ldout = hidden_dim;
    MPI_Barrier(MPI_COMM_WORLD);
    CUBLAS_CHECK(cublasSgemm(cublasH, transa, transb, hidden_dim, num_nodes, dim,
                             &alpha, d_W, ldw, d_out, ldx, &beta,
                             d_out, ldout));

    cudaFree(d_W);
    cudaFree(d_out);

    nvshmem_finalize();
    MPI_Finalize();

    return 0;
}

It is a demo to implement: d_out = d_W * d_out; like a forward in deep learning.
I change num_nodes to change d_out’s size.
when I set num_nodes as 2500,It can run correctly.
however, I change num_nodes to 10000000,It will raise :

cublas error 13 at /pipegnn/src/mgg_test.cu:65
terminate called after throwing an instance of 'std::runtime_error'
  what():  cublas error
[powerleader:04808] *** Process received signal ***
[powerleader:04808] Signal: Aborted (6)
[powerleader:04808] Signal code:  (-6)
[powerleader:04808] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7fdda43a1420]
[powerleader:04808] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fdda3e8a00b]
[powerleader:04808] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fdda3e69859]
[powerleader:04808] [ 3] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e8d1)[0x7fdda42438d1]
[powerleader:04808] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c)[0x7fdda424f37c]
[powerleader:04808] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3e7)[0x7fdda424f3e7]
[powerleader:04808] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa699)[0x7fdda424f699]
[powerleader:04808] [ 7] build1/mggTest(+0x1913d)[0x56186765a13d]
[powerleader:04808] [ 8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fdda3e6b083]
[powerleader:04808] [ 9] build1/mggTest(+0x1834e)[0x56186765934e]
[powerleader:04808] *** End of error message ***
cublas error 13 at /pipegnn/src/mgg_test.cu:65
terminate called after throwing an instance of 'std::runtime_error'
  what():  cublas error
[powerleader:04807] *** Process received signal ***
[powerleader:04807] Signal: Aborted (6)
[powerleader:04807] Signal code:  (-6)
[powerleader:04807] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f286ae05420]
[powerleader:04807] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f286a8ee00b]
[powerleader:04807] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f286a8cd859]
[powerleader:04807] [ 3] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e8d1)[0x7f286aca78d1]
[powerleader:04807] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c)[0x7f286acb337c]
[powerleader:04807] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3e7)[0x7f286acb33e7]
[powerleader:04807] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa699)[0x7f286acb3699]
[powerleader:04807] [ 7] build1/mggTest(+0x1913d)[0x55881b62e13d]
[powerleader:04807] [ 8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f286a8cf083]
[powerleader:04807] [ 9] build1/mggTest(+0x1834e)[0x55881b62d34e]
[powerleader:04807] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node powerleader exited on signal 6 (Aborted)

I run it in A800 80G and A100 80G. and there has no other progam in GPUs.and I follow the nvidia-smi when running.I noticed that the max Gpu memory used just be 7449MB。
what is the reason?

MatColgrove · December 5, 2023, 4:44pm

Looks to be an integer overflow as the value of “num_nodes * max_dim” is greater than INT_MAX. To fix, change the declaration of “num_nodes” to be an 64-bit integer, i.e. “long”, “int64_t” or “size_t”. This will promote the size computation to be 64-bits and allow for sizes above 2GB.

Hope this helps,
Mat

728882065 · December 6, 2023, 2:06am

thanks!
I didn’t expect it to be for this reason.I am not familiar enough with Cpp.
Your reply has been of great help to me. Thank you very much for your help.

728882065 · December 16, 2023, 9:31am

my nvshmem suddenly fails to finalize.
I use the Attribute-Based Initialization Example in Examples — NVSHMEM 2.10.1 documentation to test it.
It rasie:

/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:register_state_ptr:93: Redundant common pointer registered, ignoring.

/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:register_state_ptr:93: Redundant common pointer registered, ignoring.

1: received message 0
0: received message 1
/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:1051: non-zero status: 1 Invalid context pointer passed to nvshmemx_host_finalize.

/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:nvshmemx_host_finalize:1128: /dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:1051: non-zero status: 1 Invalid context pointer passed to nvshmemx_host_finalize.

/dvs/p4/build/sw/rel/gpgpu/toolkit/r11.8/main_nvshmem/src/host/init/init.cu:nvshmemx_host_finalize:1128: aborting due to error in nvshmem_finalize 

aborting due to error in nvshmem_finalize 

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[2802,1],1]
  Exit code:    255
--------------------------------------------------------------------------

MatColgrove · December 18, 2023, 5:36pm

While I try to do my best, since I’m not an expert in using NVShmem, I’d suggest you ask this question over on the GPU-Accelerated Libraries forum: GPU-Accelerated Libraries - NVIDIA Developer Forums

728882065 · December 19, 2023, 3:04am

ok. thanks you for your help.

system · January 2, 2024, 3:04am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Nvshmem fails to finalize GPU-Accelerated Libraries cuda , nvshmem	4	899	January 16, 2024
Internode nvshmme and ib problem GPU-Accelerated Libraries nvshmem	20	1156	April 24, 2024
Strange "unspecified launch error" from a call to cublas gemm CUDA Programming and Performance	23	2646	January 19, 2019
Deepstream pipeline waits for input indefinitely DeepStream SDK deepstream61	19	674	June 16, 2022
"Could not run fq2bam" Is the only verbose output from Parabricks 4.4.0-1 and 4.3.2-1 on tutorial data Parabricks ai , demos-and-tutorials , fq2bam	14	136	February 28, 2025
cuBLAS call from kernel in CUDA 10.0 GPU-Accelerated Libraries	9	4816	April 7, 2021
Help!! I can't get my NVidia GeForce GT 525M to load in a single CUDA PTX kernel!! CUDA Programming and Performance	11	5794	November 16, 2012
NVSHMEM runtime error GPU-Accelerated Libraries nvshmem	11	1814	August 16, 2022
Program hit cudaErrorInvalidValue (error 1) due to "invalid argument" on CUDA API call to cudaMemsetAsync CUDA Programming and Performance	7	7235	January 11, 2020
NVSHMEM on multi-node GPUs failed . My gpu is A5000 GPU-Accelerated Libraries nvshmem	5	813	April 1, 2024

Raise error when link nvshmem in my application

Related topics