Windows 11 + WSL + CUDA-aware MPI + GeForce 40 series = seg fault, but with GeForce 30 series = OK

MatColgrove · May 10, 2024, 4:10pm

This is a bit out of my area so I’m not 100% sure, but according to the GPU Direct Docs:

GPUDirect RDMA is available on both Tesla and Quadro GPUs.

Hence I question if it’s really using GPU Direct on the 3070 given its an RTX. I’d run the program under Nsight-Systems with MPI tracing enabled to see if the data is being brought back to the host rather that directly between the devices.

Again I’m not positive, but I wouldn’t think this would cause the HPCX segv. I’d expect it to fallback to the host. Though you’re using WSL so maybe?

I’ve had issues with HPCX and CUDA Aware MPI before (which I report to the HPCX team), but my typically work around is to change the transport via the following environment variables.

UCX_TLS=self,shm,cuda
UCX_MEMTYPE_CACHE=n

Not sure this will work for you, but the different transports are documented at: Frequently Asked Questions — OpenUCX documentation

Also after looking at the Known Issues for HPCX

Other another thing to try is setting: UCX_IB_GPU_DIRECT_RDMA=n

This disables GPU Direct so you wouldn’t see much benefit from CUDA Aware MPI, but if the 4070 doesn’t support GPI Direct anyway, and gets you past this error, then it should be ok.

Topic		Replies	Views
NV 24.1 Default MPI seg faulting on derived type host_data MPI calls - sometimes nvc, nvc++ and nvfortran	15	772	June 6, 2024
Request support/help for PBS with OpenMPI Legacy PGI Compilers	21	14917	August 9, 2022
470.14 - WSL with W10 Build 21343 - NVIDIA-SMI error CUDA on Windows Subsystem for Linux	43	19011	November 21, 2021
Behaviour of OpenMP target maps with Fortran arrays nvc, nvc++ and nvfortran	12	71	February 11, 2025
Installation on WSL2/Windows 11 problem - can't see GPU CUDA on Windows Subsystem for Linux	11	20662	January 15, 2025
Compiling Fortran code to run on rtx 4090 Legacy PGI Compilers	29	2272	July 26, 2024
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver CUDA on Windows Subsystem for Linux	33	23136	May 1, 2021
Cpp pytorch inference OpenGL tensorrt , cuda , tensorflow , nvbugs	8	1336	June 27, 2023
Did anybody get SPI interface working on TX1? Jetson TX1	37	12012	October 18, 2021
CUDA Toolkit 3.0 beta released now with public downloads CUDA Programming and Performance	104	430114	March 25, 2010

Windows 11 + WSL + CUDA-aware MPI + GeForce 40 series = seg fault, but with GeForce 30 series = OK

Related topics