Problem running mpirun from head node in cluster

I went ahead and chose the Linux Red Hat version 4 for the operating system although there is a CentOS running (this is another topic in the forum). In this configuration I chose the head node to be a compuatation node. Since the head node has 4 processors, the mpirun -np x mpihello works fine for upto 4. After that (it needs to communicate with other nodes) it just hangs. In my last installation I had not used the head node for computation and I could not do mpirun from the head node at all.

Now if I log on to another node I can do an mpirun -np x mpihello where x works fine upto the max number of processors on cluster.

So why is it that if mpirun -np x mpihello is invoked from the head node it cannot contact other nodes but vice versa works? (I am using rsh as the communication protocol between nodes)

Also note that something like mpirun -np 10 hostname works fine. So the problem appears to be when I make the MPI calls in mpihello.

Dear adityak,

CentOS is based on Red Hat. If you have CentOS 4, then Red Hat 4 should be your choice.

For MPI, there are a few questions regarding how your MPI installation.

  1. Did you install as root or non-root?

  2. Where did you install, local area or shared file where slave nodes can see?

  3. Assume you run MPICH1: What is in mpi/mpich/share/machines.LINUX. They should have the names of all the nodes you want to run, otherwise you will need to edit it and add hostnames of slave nodes or provide machinefile. If you are root, the install script should handle this and it modifies /etc/hosts file as well.

  4. What is nolocal in mpi/mpich/bin/mpirun.arg set to? I would recommend that you allow the head node to be part of the computation. The user can always have a choice to not run on head node. If allow, nolocal should be set to 0. If nolocal=1, you cannot run on local machine.

  5. What is RSHCOMMAND in mpi/mpich/bin/mpirun set to?
    It normally sets to either /usr/bin/ssh or /usr/bin/rsh.

  6. Use the full path of ssh or rsh from 5), assume rsh/ssh is from /usr/bin, try this on head node:

/usr/bin/ssh headnode date
/usr/bin/rsh headnode date

headnode here is the name of your head node. This name should be the same appears in machines.LINUX.

Also try to and from head node and slave nodes, and among slave nodes themselves. It is required for MPI that this must work.

Do they work? If not, then there is a fundamental problem with the cluster that needs to be fixed first.


Hi Hongyon,

Ok, here’s the problem. I realised that there are two mpirun scripts and the one that I was using is in the /usr/bin/. If I use the script from /opt/pgi/linux86-64/6.2/mpi/mpich/bin I get the following error

0 - MPI_INIT : MPIRUN chose the wrong device ch_p4; program needs device

Both these scripts are different too. Any ideas on this?

Also the date command works in all combinations you mentioned using both rsh and ssh. In think we can discuss more on this after I know which mpirun script to use.


Hi Aditya,

Please use mpirun from /opt/pgi/linux86-64, also make sure to set environment variable PGI to /opt/pgi. The PGI MPI scripts rely on it. That could be the reason you get an error.

Here is an example for setting 64-bit environment for running MPICH1 for csh. Assuming you install PGI CDK in /opt/pgi.

% setenv PGI /opt/pgi
% setenv PATH /opt/pgi/linux86-64/6.2/bin:$PATH
% setenv PATH /opt/pgi/linux86-64/6.2/mpi/mpich/bin:$PATH
% which mpirun # check which mpirun, should come from /opt/pgi.
% which pgf90 # check PGI compiler.

Make sure that you recompile and run a program. Also check machines.LINUX if you don’t provide machinefile when run your program.

I am not sure how /usr/bin/mpirun got there, it could be that somebody installed it and possibly it was configured to use shared memory. That’s why you got an error regarding shared memory.

As Mat mentioned, PGI CDK is not configured to use shared memory. We configured it to use device ch_p4. This could be one of a few differences in both mpirun scripts.


Hi Hongyon,
I did try setting the path variables as you directed and running the compilers and mpirun from /opt/pgi, but as I mentioned I get the error

0 - MPI_INIT : MPIRUN chose the wrong device ch_p4; program needs device

It ask for ch_ipath device. Since the mpi library provided with CDK is in built using the -ch_p4 option, will this not work for an Infinipath cluster? Is getting a library with -ch_ipath is my only option? If so, how can I get that?


Hi Aditya,

Our PGI CDK is not configured or supported Infiniband cluster yet. However, I believe you can use the mpi scripts and libraries that comes with the cluster which supports Infiniband cluster, which, you say, are in /usr/bin.

I have looked up online, they say you can configure the mpi scripts in /usr/bin to use PGI compilers or any compilers by edit the mpicc/mpiCC/mpif90/mpif77 scripts. Then, you might want to set path to just point to PGI compilers, but not set it to PGI CDK MPI because you will use MPI libraries which comes with the Infiniband, I guess. Again, I am not really sure what the scripts look like or what MPI libraries come with it since we don’t have Infinipath in house.

We are getting Infiniband cluster up and running in a few days, if above info. does not work for you. We will post more information that might help.