Theortical Bandwidth and Latency of NIC

Hi guys:

I am currently conducting research and using ib_write_bw and ib_write_lat
I am using EDR x 4.

From my understanding, EDR x 4 uses 64/66 encoding on the physical link, and the signaling Bandwidth is 25.78Gbps * 4, making the theoretical bandwidth 25.78Gbps * 4 * (64/66) = 100Gbps

However, when I was doing ib_write_bw, I noticed the bandwidth ceiling is 96.6Gbps instead of 100Gbps. I know usually people are satisfied with 96Gbps, but just out of curiosity. Why?

Or is the actual signaling BW for EDR 25Gbps, instead of 25.78Gbps? In that case after applying encoding loss we get 25Gbps * 4 * (64/66) = 96.6Gbps

Or is it because of processing overhead?

The latency of sending 2097152B in ib_write_lat test is ~176us, but I am also wondering isn’t the theoretical latency of sending 2097152B in 100Gbps: 1s/(((100Gbps/8) * 1024 * 1024 * 1024)/2097152) = 156us?
Where is the middle 176us - 156us = 20us goes to?
Is there any methods I could find out?

We are doing usec level datacenter application related research, so every usec is pretty important to us.

~> ib_write_bw -d mlx5_1 -a -F --report_gbits --disable_pcie_relaxed 192.17.103.4
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_1
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: OFF
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x02 QPN 0x0154 PSN 0x76f7ac RKey 0x1bdfdf VAddr 0x007f3e25da2000
 remote address: LID 0x01 QPN 0x0154 PSN 0xc8c064 RKey 0x1c0100 VAddr 0x007f8d8d6ee000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 2          5000             0.14               0.11               6.688079
 4          5000             0.28               0.25               7.727968
 8          5000             0.55               0.47               7.413184
 16         5000             1.11               1.00               7.773782
 32         5000             2.23               1.99               7.782522
 64         5000             4.45               3.99               7.783585
 128        5000             8.90               7.58               7.404378
 256        5000             17.81              15.68              7.653848
 512        5000             35.31              30.64              7.479293
 1024       5000             70.02              57.50              7.018787
 2048       5000             94.71              87.01              5.310566
 4096       5000             95.81              93.54              2.854465
 8192       5000             95.67              94.43              1.440854
 16384      5000             95.88              95.73              0.730324
 32768      5000             96.08              96.07              0.366466
 65536      5000             96.38              96.38              0.183824
 131072     5000             96.16              96.13              0.091677
 262144     5000             96.39              96.39              0.045962
 524288     5000             96.39              96.37              0.022976
 1048576    5000             96.52              96.52              0.011506
 2097152    5000             96.52              96.51              0.005753
 4194304    5000             96.49              96.49              0.002876
 8388608    5000             96.50              96.50              0.001438
---------------------------------------------------------------------------------------

Appreciate so much,
Jerry

Hi Jerry,

I’m going to answer the first question (why don’t you get 100Gbps) and I hope you can derive the rest from that.

When using ib_write_bw tool, there is a parameter called MTU - this parameter defines the maximum IB payload length (in bytes), and it must be a power of two. In your case you defined it as higher than 4KB so the IB MTU chosen (as described in you output) is 4096B. When we calculate the actual RoCE underlay BW we need to take into account that RoCE has a header which is essential, but it’s not the data itself (=payload).

Total header length (typical, but might be a bit different in your case) = (Layer 1 (PHY) = 20 bytes) + (Layer 2 (MAC) = 18 bytes) + (Layer 3 (IP) = 20 bytes) + (Layer 4 (UDP) = 8 bytes) + (BTH (IB header) = 12 bytes) + (RETH (IB header) = 16 bytes) + (ICRC (IB) = 4 bytes) = 98 bytes
You can review your exact message layout with tcpdump.

Let’s calculate the maximum bandwidth in case MTU=4096 and message size = 2097152 Bytes:

  • As explained above - the ib_write_bw MTU will be 4096
  • Number of MTUs in one message = 2097152 / 4096 = 512
  • Number of actual bytes we need to send to transfer the message (with the headers/overhead) = 512 * (4096+98) = 2147328
  • Maximum bandwidth = 2097152 / 2147328 * 100 = 97.66%

Note that beyond the message layout (which may include more headers/overhead) you can see a bit different bandwidth because there are a lot of parameters that might affect the bandwidth (like PCI configuration, CPU, and more).

Regards,
Yaniv

Hi Yaniv:

Thank you so much for answering. Actually, I was using Infiniband instead of RoCE, so some of the header overhead will be gone. for RDMA_Write, the overhead should be <40B for header? I was expecting ~99Gbps based on that number LOL.

Hmm is there any way I could push this number to as close to 100Gbps as possible? Or is there any ways we could at least find the source of that missing 3.5Gbps?

Jerry

Pin cores to closest NUMA, disable c-states in BIOS, run mlnx_tune to optimize settings for throughput (pay attention to the feedback the output conveys - fix any warning etc), play with pcie MRR, max payload - and even then you may not achieve the required 99gbps. It also depends on other factors on the server which we may not have control on.