Scalability issue for multiple clients


Our setup is

1 x Mellanox MX354A Dual port FDR CX3 adapter w/1 x QSA adapter

1 x Xeon E5-2450 processor (8 cores, 2.1Ghz)

16GB Memory (4 x 2GB RDIMMs, 1.6Ghz)

We have 4-node cluster, and all of them are server and client at the same time.

When write, a node split data into 4 pieces and concurrently write to 4 nodes.

When read, a node read from 4 nodes.

We expect it scales with the number of multiple clients.

when a node is reading, it can get 6.4 GB/s bandwidth

but when 2 nodes are reading, both only get 5GB/s each, although aggregated bandwidth is enough.

There’s only 1 CPU, no NUMA discrepancy arises.

Concerned possible NIC cache misses, measured PCIe Read using pcm-pcie.

Simply PCIe read cannot scale with increasing number of clients even if its bandwidth is actually much higher.

There must be contention when multiple connections(QPs) read from a single server.

Please Mellanox, can you pinpoint the root cause and possible solution for multiple-client scalability?

It might be useful if you can transform your test description to something that uses ib_read_bw/ib_write_bw or iperf (if you are using TCP) and show what is the output.

Can you see any drops in ‘ethtool -S’ output or device statistic ( ifconfig, ip)?

Does the sender use 1 CPU when writing to two different clients? Or, in other words, does he use the same thread?

You might check what is the output of ‘mlnx_perf’ command, however you need Mellanox OFED installed on the host.