Hi,
What are the proper parameters to configure a rebuild of the HPC-X ompi for GPUDirect and RDMA to be included into the library?
In the document “https://www.open-mpi.org/faq/?category=runcuda”, under “GPUDirect RDMA Information”, there are commands given to see if you have GPUDirect RDMA compiled into your library:
$ ompi_info --all | grep btl_openib_have_cuda_gdr
and
ompi_info --all | grep btl_openib_have_driver_gdr
Neither of those result in “true” (the flags don’t even appear to be present).
In the document “http://www.mellanox.com/related-docs/prod_software/Mellanox_GPUDirect_User_Manual.pdf”, you give a configuration string for recompiling openmpi for running GPUDirect RDMA:
./configure --prefix=/path/to/openmpi-1.10.0_cuda7.0 --with-wrapper-ldflags=W1,-rpath,/lib --disable-vt --enable-orterun-prefix-by-default -disable-io-romio --eanble-picky --with-cuda=/usr/local/cuda-7.0
It appears that the “–with-cuda” is the key there, even though the versions are out of date. This is also against a trunk build of openmpi.
I have gpudirect and gdrcopy both properly installed.
In both the default HPC-X installation and in my build the config.status file shows “mpi_build_with_cuda_support” as true.
Is this just a mix of out of date information?
Does that flag (“mpi_build_with_cuda_support”) correlate to GPUDirect and RDMA being corectly configured into the ompi build?
How/where does one verify that?
The HPC-X docs discuss it, but it ends up just giving some generics:
https://docs.mellanox.com/display/hpcxv24
Thanks