This is a bit out of my area so I’m not 100% sure, but according to the GPU Direct Docs:
GPUDirect RDMA is available on both Tesla and Quadro GPUs.
Hence I question if it’s really using GPU Direct on the 3070 given its an RTX. I’d run the program under Nsight-Systems with MPI tracing enabled to see if the data is being brought back to the host rather that directly between the devices.
Again I’m not positive, but I wouldn’t think this would cause the HPCX segv. I’d expect it to fallback to the host. Though you’re using WSL so maybe?
I’ve had issues with HPCX and CUDA Aware MPI before (which I report to the HPCX team), but my typically work around is to change the transport via the following environment variables.
UCX_TLS=self,shm,cuda
UCX_MEMTYPE_CACHE=n
Not sure this will work for you, but the different transports are documented at: Frequently Asked Questions — OpenUCX documentation
Also after looking at the Known Issues for HPCX
Other another thing to try is setting: UCX_IB_GPU_DIRECT_RDMA=n
This disables GPU Direct so you wouldn’t see much benefit from CUDA Aware MPI, but if the 4070 doesn’t support GPI Direct anyway, and gets you past this error, then it should be ok.