I am working on a project in which I have a small set of servers with ConnectX 3 HCAs connected to an IS5030 switch.
No IP, just IB.
Given either 1 or many multicast groups, with one reader and one writer on each machine with the appropriate cpu affinity,
I observe the following behavior:
Only 1 writer in the cluster, everything else reads: the only increments in XmitWait is on the sending HCA that is just trying to get the multicast packets to the switch.
All of the IB counters on everything look great, even at many multiples of message rate compared the problem scenario below.
If I introduce just 1 more multicast writer into the mix and they are both at 5k msg/sec, XmitWait on the transmitting switch ports for the multicast group start growing. The more writers, the worse it gets.
A subnet manager is running on the switch. I have tried segregating the traffic into different VLs and turning on congestion control.
There is something about two machines generating multicast traffic to the same switch at any decent frequency.
I’m using 4k buffers but my message size is only 512 bytes.
Does anyone have any insight into what would be causing the congestion?