I am testing cuFFTMp on WSL, where I have install the newest version of HPCSDK. And I encountered the following error when I testing the example code: NVIDIA/CUDALibrarySamples: CUDA Library Samples (github.com)
Hello from rank 0/2 using GPU 0
Hello from rank 1/2 using GPU 1
src/init/init.cu:766: non-zero status: 7 nvshmemi_common_init failed ...src/init/init_device.cu:nvshmemi_check_state_and_init:55: nvshmem initialization failed, exiting
src/util/cs.cpp:21: non-zero status: 16: Bad file descriptor, exiting... mutex destroy failed
src/init/init.cu:766: non-zero status: 7 nvshmemi_common_init failed ...src/init/init_device.cu:nvshmemi_check_state_and_init:55: nvshmem initialization failed, exiting
src/util/cs.cpp:21: non-zero status: 16: Bad file descriptor, exiting... mutex destroy failed
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[52101,1],0]
Exit code: 255
--------------------------------------------------------------------------
make: *** [Makefile:18: run] Error 255
This is the reshape code, and similar error happens in all other examples in the cuFFTMp/samples
folder. I running these examples by the default make run
. And I have 2 4090 devices in my computer.