we have a few large clusters which came with Mellanox dual-port HCAs (QDR+10GigE). Initially the clusters were setup as RoCE clusters but now we have acquired and continue acquiring IB FDR fabric infrastructure.
One the cluster with the dual-port QDR+10GigE some MPI stacks (OpemMPI 1.6.5 or 1.7.2 and IntelMPI 4.1.1) started getting confused with communication stalling at times completely.
When both ports are configured, is there any special setting so that both 10GigE/RoCE and IB parts work without interfering one with the other? Do I need to setup opensm which manages the IB part to only use the IB port for IB fabric management? Can you please suggest any guidelines for this situation with both ports being configured?
Is there any adverse effect having BOTH RoCE and IB operating on a cluster at the same time?
Systems run RHEL 6.3 using the stock OFED and opensm that came with RHEL 6.3.
uname -a :
Linux host 2.6.32-279.25.2.el6.x86_64 #1 SMP Tue May 14 16:19:07 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux
From what I see the configuration you are describing should be workable. When you described your issue you said that the connections became confused and stalled at times. I feel you were implying below this that the issue came from using both InfiniBand and Ethernet on one card.
Could you elaborate on this error?
Is there a specific output you are seeing?
What is the traffic like on these ports during this error condition?
Is it when both cards are putting traffic on egress on both ports, or ingress, or a mix?
I noticed you are using the RHEL6.3 Community OFED, have you tried your success with our driver?
Could you provide the ibv_devinfo output for your machines? I would like to see the PSID of your cards.