Trouble with a simple mpif90

Hello!

I just updated to 23-11.
However, I am running into some errors. Here is the output.

type o mpif90 baby.f90
erin1@eacer:~$ mpirun -np 4 a.out
[LOG_CAT_ML] Unable to get list of available IB devices (ibv_get_device_list failed)
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[LOG_CAT_ML] Unable to get list of available IB devices (ibv_get_device_list failed)
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[LOG_CAT_ML] Unable to get list of available IB devices (ibv_get_device_list failed)
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[LOG_CAT_ML] Unable to get list of available IB devices (ibv_get_device_list failed)
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[eacer:03567] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[eacer:03566] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[eacer:03573] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[eacer:03568] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
            1  I am a worker
            3  I am a worker
            0  I am the boss!
            2  I am a worker

My Fortran program is very simple.

       program baby
        implicit none
        include 'mpif.h'
        integer rank,size,ierror,i,np

        call MPI_INIT(ierror)

          call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierror)

         call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierror)

         if(rank.eq.0) then
            print *, rank," I am the boss!"

            else
               print *, rank," I am a worker"
               end if

       call MPI_FINALIZE(ierror)
  end program baby

Any suggestions would be much appreciated.
Sincerely,
Erin

It is not a real error.
HPC-X enables HCOLL by default and HCOLL requires an Infiniband adapter.
You can disable HCOLL and get rid of these warnings/errors with:

mpirun -np 4 -mca coll_hcoll_enable 0 ./a.out

Awesome sauce!

Thank you so much!

Is this something new, please? I have been using both CUDA and mpi via NVIDIA for lo these many years, and have never seen this before.

Thanks again,
Erin

This is simply a pre-built-for-infiniband Open-MPI complaining that your system doesn’t have infiniband. Note that the program actually runs fine. You could use the nompi version of the compiler and mix it with a locally installed Open-MPI if you like.

I think in 23.11 the default OpenMPI is the one from HPCX ( that has HCOLL enabled by default). Before 23.11, there were two other OpenMPI builds (3.x and 4.x), in which HCOLL was not enabled.

Hello!

How would I go about using the other MPI compiler, please? I did download/install OpenMPI.

Thank you

You can load the nvidia module and set OMPI_CC etc. environments. Or probably better, build your own Open-MPI using the nvidia compilers and then underlying compile/link flags should all sync up nicely. Although as Mat noted, things are actually working: the Open-MPI is just complaining about what you have on your system, and you can tell Open-MPI to shut up.

Hi Erin,

To use one of the other MPIs we ship, set you PATH to point to either the “openmpi4/bin” or “openmpi/bin” directory under the “/base/path/Linux_x86_64/23.11/comm_libs” directory. The “mpi” directory is just a link to “hpcx”. Adjust the base directory to match your installation.

-Mat

Sorry for being dense. I am using WSL2 with Ubuntu 22.04 on a Windows laptop with GeoForce 3070 card.

I have tried all sorts of variations on the theme for using the nvfortran, pointing to my /usr/local directories, but nothing seems to work.

Thank you

Did you install in /usr/local ? Default is /opt/nvidia/hpc_sdk/...

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.