Nvshmem Runtime error when running the cuFFTMp example code?

batman216 · May 20, 2023, 10:56am

I am testing cuFFTMp on WSL, where I have install the newest version of HPCSDK. And I encountered the following error when I testing the example code: NVIDIA/CUDALibrarySamples: CUDA Library Samples (github.com)

Hello from rank 0/2 using GPU 0
Hello from rank 1/2 using GPU 1
src/init/init.cu:766: non-zero status: 7 nvshmemi_common_init failed ...src/init/init_device.cu:nvshmemi_check_state_and_init:55: nvshmem initialization failed, exiting
src/util/cs.cpp:21: non-zero status: 16: Bad file descriptor, exiting... mutex destroy failed
src/init/init.cu:766: non-zero status: 7 nvshmemi_common_init failed ...src/init/init_device.cu:nvshmemi_check_state_and_init:55: nvshmem initialization failed, exiting
src/util/cs.cpp:21: non-zero status: 16: Bad file descriptor, exiting... mutex destroy failed
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[52101,1],0]
  Exit code:    255
--------------------------------------------------------------------------
make: *** [Makefile:18: run] Error 255

This is the reshape code, and similar error happens in all other examples in the cuFFTMp/samples folder. I running these examples by the default make run. And I have 2 4090 devices in my computer.

MatColgrove · May 22, 2023, 4:57pm

Hi batman216,

I don’t have access to a WSL2 system with multi-gpus so can’t recreate this error. However I did test on a native Linux system and it works fine.

Have you tried running other example OpenACC or Cuda codes?

A common issue on WSL2 is that the Cuda driver is located in a non-system directory, “/usr/lib/wsl/lib”, so programs will fail since they can’t find the driver. The solution is to set the LD_LIBRARY_PATH environment variable to include this path to libcuda.so.

A quick test is to run “nvaccelinfo”. If it displays the GPU details, you’ll know that libcuda.so is being found.

-Mat

batman216 · May 23, 2023, 3:21am

Hi, thanks for reply

I have tested a lot of cuda/acc program on WSL2, they all works fine.

I find in the documentation that NVSHMEM only support for data center GPU, while I am using two 4090 GPU, it that the problem?

MatColgrove · May 23, 2023, 3:57pm

I find in the documentation that NVSHMEM only support for data center GPU, while I am using two 4090 GPU, it that the problem?

Highly likely, but this is out of my area of expertise so I don’t know for sure.

Per NVSHEMM’s Release notes: https://docs.nvidia.com/nvshmem/release-notes/release-290.html#release-290

Only V100, A100, and H100s are listed. Also for PCIe cards, it appears Infiniband or UCX needs to be installed.

My understanding is that GPU Direct is also needed which I don’t think RTX devices have.