Compute node do not recognized infiniband cable(network cable unplugged issue)

Hello. First of all, I wish you keep good health in this difficult situation suffering with Corona-19.

We are currently constructing Infiniband system on a Windows Server cluster system(Windows Server 2008 R2 SP1) which is composed approximately 100 nodes. During the process, we are having in the face of serious technical difficulties the compute node is not recognizing the cable.

(Perhaps, it is a problem that can be solved very easily for experienced people with a lot of relevant knowledge.)

The current progress and the problems we are experiencing are as follows.

(So far, only three nodes are being tested. AD(Active derectory) server / Headnode(LC300) / Compute node(LC301))

1. The status of before joining the local domain

a. Installing driver and setting IP is done at Head Node(LC300)

b. Installing driver and setting IP is done at Compute Node(LC301)

c. Subnet manager service on the Head node(LC300)

(Error) After this step, it was expected that the two computer networks would have to change from unplugged to unidentified network. However, only the network of the head node is an unidentified network, and the compute node still remains network cable unplugged.

However, the LED status behind the two nodes shows that if the subnet manager service is turned on at the head node, the data sending/receiving LEDs are flashing simultaneously and are expected to be connected.

2. The status of after joining the local domain (Just I tried.)

a. Join a Head node(LC300) to a local domain.

b. Build secondary AD server role on Head Node(LC300)

c. Join a Compute Node(LC301) to a local domain.

(Error) As expected, the head node changed from an unidentified network to a local network, but the compute node still remains network cable unplugged. The LED status lamp is in the same state as mention before.

All other settings, such as node-to-switch and switch-to-switch connectivity, IP settings, AD or DNS servers, routing server, firewall off, etc., are working normally.

There is no way of knowing if there is a problem in any other place, including setting up OpenSM, etc.

Equipment currently in use is as follows.

Card : QDR HCA MHQH19B-XTR

Driver : MLNX_VPI_WinOF-5_35_All_win2008R2_x64.exe

Switch : IS5025

Below are the status of newtwork on each node, and result of checking switch port using iblinkinfo, and the message when the subnet manager service on at the head node are attached.

I read many manuals and studied them, but I couldn’t find a solution. I ask help for here as a last resort. Thank you very much.

=================================================================

Network status of HeadNode(LC300)===================================

=================================================================

Network status of ComputeNode(LC301)================================

=================================================================

Subnet manager.txt (307 Bytes)

iblinkinfo.txt (9.64 KB)

Hi ,

I will try to assist, first could you please the following :

  1. Network topology . what is connected to what ?

  2. Switch / network adapter / cable type and P/N ?

  3. From which node the SM is running ?

Thanks,

Samer

Hi ,

Are you still encountering the issue ?

Thanks,

Samer