Compute Nodes provisioning failed " Installer Unreachable"

Hi Folks

My cluster got powered off due to lab outage. After powering on Head node , I was trying to get the compute nodes up , they all the stuck at the screenshot attached , please let me know if you need more information.

bright-node-stuck

I tried the below link , its still not working

please refer the output below as well

Wed Sep 28 15:03:13 2022 [notice] bright88: node002 [ INSTALLER_UNREACHABLE ] (ldlinux.c32 from bright88)
[bright88]% category use default
[bright88->category[default]]% set bootloaderprotocol tftp
[bright88->category*[default*]]% commit
[bright88->category[default]]%
Wed Sep 28 15:07:02 2022 [notice] bright88: Service dhcpd was restarted
[bright88->category[default]]% show
Parameter                        Value
-------------------------------- -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Name                             default
Nodes                            2
Revision
Use exclusively for
User node login                  ALWAYS
Filesystem mounts                <6 in submode>
Time zone                        Asia/Seoul (base)
Roles                            <0 in submode>
Static routes                    <0 in submode>
GPU Settings                     <0 in submode>
Filesystem exports               <0 in submode>
Services                         <0 in submode>
Kernel modules                   72 (software image:default-image)
Default gateway                  192.168.61.88 (network: internalnet)
Management network               internalnet
Software image                   default-image
Node installer disk              no
Install boot record              no
Install mode                     AUTO
New node install mode            FULL
Name servers
Time servers
Search domain
Exclude list full install        <234B>
Exclude list sync install        <1.33KiB>
Exclude list update              <4.2KiB>
Exclude list grab                <1.70KiB>
Exclude list grab new            <1.35KiB>
Exclude list manipulate script   <0B>
Initialize script                <0B>
Finalize script                  <3.4KiB>
Data node                        no
Allow networking restart         no
Version config files             no
Kernel version                   3.10.0-1160.11.1.el7.x86_64 (software image:default-image)
Kernel parameters                biosdevname=0 net.ifnames=0 nonm acpi=on nicdelay=0 rd.driver.blacklist=nouveau xdriver=vesanamespace.unpriv_enable=1 user_namespace.enable=1 (software image:default-image)
Kernel output console            tty0 (software image:default-image)
IO scheduler
Boot loader                      syslinux
Boot loader protocol             TFTP
Boot loader file
FIPS                             no
BMC Settings                     <submode>
SELinux Settings                 <submode>
Disk setup                       <1.07KiB>
Hardware RAID configuration      <0B>
Scaling governor
BIOS setup                       <0B>
Provisioning associations        <1 internally used>
Notes                            <0B>

Hi,

From the screenshot, I see that the node has already booted and mounted the /cm/node-installer from the headnode. At the stage “Running the node-installer” the node will be running “/scripts/node-installer”. “INSTALLER_UNREACHABLE” may indicate that the head node has lost connection to the compute node. Did you try to ssh into the node and check the /var/log/node-installer? I also see that the MTU is set to 9000, perhaps the MTU needs to be adjusted to a lower value?

Perhaps you can open a support ticket at “Bright Computing Support Form”? Then it will be easier to submit logs and screenshots.

Kind regards,
adel

Hi Adel

Compute nodes are pinging but SSH not going through , please refer below

[root@bright88 var]# ping 192.168.61.89
PING 192.168.61.89 (192.168.61.89) 56(84) bytes of data.
64 bytes from 192.168.61.89: icmp_seq=1 ttl=64 time=0.118 ms
64 bytes from 192.168.61.89: icmp_seq=2 ttl=64 time=0.058 ms
64 bytes from 192.168.61.89: icmp_seq=3 ttl=64 time=0.119 ms
^C
--- 192.168.61.89 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2004ms
rtt min/avg/max/mdev = 0.058/0.098/0.119/0.029 ms
[root@bright88 var]# ping 192.168.61.90
PING 192.168.61.90 (192.168.61.90) 56(84) bytes of data.
64 bytes from 192.168.61.90: icmp_seq=1 ttl=64 time=0.187 ms
64 bytes from 192.168.61.90: icmp_seq=2 ttl=64 time=0.139 ms
^C
--- 192.168.61.90 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.139/0.163/0.187/0.024 ms
[root@bright88 var]#
[root@bright88 var]#
[root@bright88 var]# ssh 192.168.61.89
ssh: connect to host 192.168.61.89 port 22: Connection refused
[root@bright88 var]# ssh 192.168.61.90
ssh: connect to host 192.168.61.90 port 22: Connection refused

Also I have Easy 8 license key , so no commercial support . I opened a support request but they directed me to this forum .
Please let me know if you need more information

Hi,

Perhaps you could connect to the console of the node and check in a separate emergency shell (alt+2->12) why the node is stuck at “Running the node-installer”?

At the stage “Running the node-installer” the node-installer will run “chroot /installer_root /linuxrc”. The /installer_root is an NFS mount on :, so I would start by checking that the mount point is ok and is readable. Perhaps you can also use a lower MTU.

If you run “ping -s 9000 -M do ” then you’ll get an indication whether the MTU value is good or needs adjustment.

Connecting a compute node back-to-back with the head node can be useful to test if the the switch in between is causing any unexpected issues.

Kind regards,
adel

It was MTU Adel !!! :)…but how come it was working before with MTU 9000 , i used this cluster for 40 days and rebooted multiple times compute machines

All devices, end-to-end, need to have the same MTU setting. Now that you know the MTU is the issue, you can investigate the settings on the switch, perhaps?

Glad that it’s working, though!

BR,
kw

Hey K , network stack is all 100G > mellanox switches enabled jumbo frames + connect X5 adapters

Very odd. Something in the mix doesn’t have jumbo frames enabled. I would offer that you should check everything, even if you “know” that it’s set to 9000. Make 100% sure.

Kind regards,

Ken Woods
Manager, Nvidia Bright Cluster Manager Support