What is the actual maximum speed of Jetson AGX Xavier PCIE Ethernet?

Dear Nvidia Team

We are using PCIe to communicate between two Xaviers, According to << Jetson AGX Xavier PCIe Endpoint Mode>>, but have some doubts in actual use.
Version information about the systems we use: JetPack4.6 kernel version:4.9.253-tegra Device:Jetson AGX Xavier
The tests I have performed are as follows:
1.Xavier B starts iperf3 server

# iperf3 -s
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------

Accepted connection from 192.168.66.6, port 40890
[  5] local 192.168.66.7 port 5201 connected to 192.168.66.6 port 40892
[ ID] Interval           Transfer     Bandwidth
[  5]   0.00-1.00   sec   359 MBytes  3.01 Gbits/sec
[  5]   1.00-2.00   sec   478 MBytes  4.01 Gbits/sec
[  5]   2.00-3.00   sec   423 MBytes  3.55 Gbits/sec
[  5]   3.00-4.00   sec   464 MBytes  3.89 Gbits/sec
[  5]   4.00-5.00   sec   484 MBytes  4.06 Gbits/sec
[  5]   5.00-6.00   sec   553 MBytes  4.64 Gbits/sec
[  5]   6.00-7.00   sec   551 MBytes  4.63 Gbits/sec
[  5]   7.00-8.00   sec   521 MBytes  4.37 Gbits/sec
[  5]   8.00-9.00   sec   167 MBytes  1.40 Gbits/sec
[  5]   9.00-10.00  sec   172 MBytes  1.44 Gbits/sec
[  5]  10.00-11.00  sec   168 MBytes  1.41 Gbits/sec
[  5]  11.00-12.00  sec   172 MBytes  1.44 Gbits/sec
[  5]  12.00-13.00  sec   161 MBytes  1.35 Gbits/sec
[  5]  13.00-14.00  sec   358 MBytes  3.01 Gbits/sec
[  5]  14.00-14.16  sec  90.2 MBytes  4.63 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth
[  5]   0.00-14.16  sec  0.00 Bytes  0.00 bits/sec                  sender
[  5]   0.00-14.16  sec  5.00 GBytes  3.03 Gbits/sec                  receiver

2.Xavier A starts iperf3 client,5G data volume test using 10000M bandwidth

# iperf3 -c 192.168.66.7 -b 10000M -n 5G
Connecting to host 192.168.66.7, port 5201
[  4] local 192.168.66.6 port 40892 connected to 192.168.66.7 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   379 MBytes  3.18 Gbits/sec    0   1.11 MBytes
[  4]   1.00-2.00   sec   472 MBytes  3.95 Gbits/sec    0   1.11 MBytes
[  4]   2.00-3.00   sec   431 MBytes  3.62 Gbits/sec    0   1.11 MBytes
[  4]   3.00-4.00   sec   464 MBytes  3.89 Gbits/sec    0   1.11 MBytes
[  4]   4.00-5.00   sec   483 MBytes  4.06 Gbits/sec    0   1.11 MBytes
[  4]   5.00-6.00   sec   554 MBytes  4.64 Gbits/sec    0   1.54 MBytes
[  4]   6.00-7.00   sec   551 MBytes  4.62 Gbits/sec    0   1.54 MBytes
[  4]   7.00-8.00   sec   507 MBytes  4.25 Gbits/sec    0   3.93 MBytes
[  4]   8.00-9.00   sec   169 MBytes  1.42 Gbits/sec    0   3.93 MBytes
[  4]   9.00-10.00  sec   171 MBytes  1.43 Gbits/sec    0   3.93 MBytes
[  4]  10.00-11.00  sec   167 MBytes  1.40 Gbits/sec    0   3.93 MBytes
[  4]  11.00-12.00  sec   173 MBytes  1.45 Gbits/sec    0   3.93 MBytes
[  4]  12.00-13.00  sec   160 MBytes  1.35 Gbits/sec    0   3.93 MBytes
[  4]  13.00-14.00  sec   371 MBytes  3.12 Gbits/sec    0   3.93 MBytes
[  4]  14.00-14.12  sec  68.1 MBytes  4.63 Gbits/sec    0   3.93 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-14.12  sec  5.00 GBytes  3.04 Gbits/sec    0             sender
[  4]   0.00-14.12  sec  5.00 GBytes  3.04 Gbits/sec                  receiver

3.View the actual physical rate of the system PCIE

# lspci
0000:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad0 (rev a1)
0000:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a809
0005:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad0 (rev a1)
0005:01:00.0 Network controller: NVIDIA Corporation Device 2296
root@neolix-xavier-rt:/home/nvidia#   318  lspci -n | grep -i 0005:01:00.0
# lspci -n | grep -i 0005:01:00.0
0005:01:00.0 0280: 10de:2296
# lspci -n | grep -i 0005:01:00.0
0005:01:00.0 0280: 10de:2296
root@neolix-xavier-rt:/home/nvidia# lspci -n -d 10de:2296 -vvv | grep --color Width
                LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM not supported, Exit Latency L0s <1us, L1 <64us
                LnkSta: Speed 16GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

After carrying out the above tests, it was found that there are several problems in use:

  • Using lspci, we can see that the actual PCIE rate is 16GT/s, Width x8, and the throughput is about 15.754GB/s, which is 126Gb/s, but the performance measurement using iperf3 shows that the average rate is only about 3.04 Gbits/sec, which is similar to The actual physical PCIE rate is very different, why is this?
  • From the test results of iperf3, it can be seen that the rate of pcie ethernet fluctuates between 1.35 Gbits/sec and 4.64 Gbits/sec. The results of many long-term tests are also the same. What is the reason for the unstable rate?
  • What is the maximum actual speed of the official Jetson AGX Xavier using PCIE Ethernet?

Thank you for your support.
Best regards

Duplicated with What‘s the maximum speed supported by PCIE Ethernet for Jetson AGX Xavier? - Jetson & Embedded Systems / Jetson AGX Xavier - NVIDIA Developer Forums

  • Using lspci, we can see that the actual PCIE rate is 16GT/s, Width x8, and the throughput is about 15.754GB/s, which is 126Gb/s, but the performance measurement using iperf3 shows that the average rate is only about 3.04 Gbits/sec, which is similar to The actual physical PCIE rate is very different, why is this?
    The bottleneck is in linux kernel IP layers.
    1.The maximum skb_buff is 64K in linux kernel, ip package need fragment/reassemble
    2.IP packet checksum calculated by software. (CPU)
    3.Linux TCP/IP layers need handle the package.

  • From the test results of iperf3, it can be seen that the rate of pcie ethernet fluctuates between 1.35 Gbits/sec and 4.64 Gbits/sec. The results of many long-term tests are also the same. What is the reason for the unstable rate?
    Did you have applied rt-patch? if yes, would you please help to try on normal kernel ?

  • What is the maximum actual speed of the official Jetson AGX Xavier using PCIE Ethernet?
    In Xavier AGX 8G module(gen3), only got 3Gbits/sec in JetPack4.4.
    If disabled software ip checksum (skb->ip_summed = CHECKSUM_UNNECESSARY)
    Got around 4.5Gbits/sec

Hi Jasonm
thank you for your reply

  • Did you have applied rt-patch? if yes, would you please help to try on normal kernel ?
    Yes, the version we are currently using has rt-patch applied,Regarding the normal kernel, we will find relevant colleagues to further test.

  • In Xavier AGX 8G module(gen3), only got 3Gbits/sec in JetPack4.4.If disabled software ip checksum (skb->ip_summed = CHECKSUM_UNNECESSARY) Got around 4.5Gbits/sec
    The speed of 4.5Gbits/sec is still far lower than our expectations. Regarding the way of using PCIE to communicate between two Xaviers, in addition to the way of PCIE Ethernet, is there any other solution that can make the speed much higher than 4.5Gbits/sec? If so, what should we do?

Hi NOPUFF,
1.Which number can satisfy your use case?
2.Does the PCIe as network interface is required? Would you mind share your use cases?

BR.,
Jasonm

  1. We want at least 10 Gigabit Ethernet speeds,If it can be closer to the PCIE actual physical rate of 16GT/s x8, of course it is the best

  2. If PCIe as Ethernet can reach our expected rate, we still hope to use this method, because it is more convenient to use as Ethernet, but if only other methods can be used to achieve a higher rate, we are also willing to accept it, our purpose It is hoped that the rate can reach 10Gbits/s or even higher

3.Briefly describe our usage scenario: we have 10 cameras, 1920x1080 NV12 30fps, 5 lidars, we need to transfer these data between XavierA and XavierB

BR.,
NOPUFF

Hi Jasonm,Can you give me some advice? thanks

Hi NOPUFF,
Thanks for your sharing!

  1. As we have talked, due to linux kernel network stack, the PCIe virtual network not match the 10Gbps requirement.
  2. For exchanging data of “10 cameras, 1920x1080 NV12 30fps, 5 lidars”, you can base on our reference code to implement you own feature.

Below patch verified Jetpack4.4
0001-gathered-all-dma-performance-test-patches.patch (11.2 KB)

RP Mode DMA

In the below procedure, x being the number of the root port controller whose DMA is being used for perf test

Write:

Go to the debugfs directory of the root port controller

#cd /sys/kernel/debug/pcie-x/

Set channel number (set it to one of 0,1,2,3)

#echo 1 > channel

Set size to 512MB

#echo 0x20000000 > size

Set source address for DMA.

For this, grep for the string “—> Allocated memory for DMA” in dmesg log and use whatever address comes up in the grep output

#dmesg | grep " —> Allocated memory for DMA"

example output would be something like

[ 7.102149] tegra-pcie-dw 141a0000.pcie: —> Allocated memory for DMA @ 0xC0000000

So, use 0xC0000000 as the source address

#echo 0xC0000000 > src

Note : - don’t forget to replace 0xC0000000 with your grep output value. In case it is not found in grep output, save full kernel boot log and search in it

Set destination address for DMA

For this, execute the following command

#lspci -vv | grep -i “region 0”

an example output would be something like

Region 0: Memory at 1f40000000 (32-bit, non-prefetchable) [size=512M]

So, use 1f40000000 as destination address

#echo 0x1f40000000 > dst

Note : - don’t forget to replace 0x1f40000000 with your grep output value. In case it is not found in grep output, save full kernel boot log and search in it

Execute write test

#cat write

It prints the output in the following format(use ‘dmesg |tail’ to get the output)

tegra-pcie-dw 14100000.pcie_c1_rp: DMA write. Size: 536870912 bytes, Time diff: 316519776 ns

Read test can be performed by interchanging ‘src’ and ‘dst’ and executing ‘cat read’ command.

EP Mode DMA

Note: Most of steps operate in RP Xavier, except extract information from EP Xavier.

Write:

In the RP console, go to the debugfs directory of the end point client driver

#cd /sys/kernel/debug/tegra_pcie_ep/

Set channel number (set it to one of 0,1,2,3)

#echo 1 > channel

Set size to 512 MB

#echo 0x20000000 > size

Set source address for EP’s DMA.

For this, grep for the string "BAR0 RAM IOVA” in dmesg log of endpoint system console and use whatever address comes up in the grep output

#dmesg | grep “BAR0 RAM IOVA”

an example output would be something like

pci_epf_nv_test pci_epf_nv_test.0: BAR0 RAM IOVA: 0xc0000000

So, use 0xc0000000 as source address

#echo 0xc0000000 > src

Note : - don’t forget to replace 0xe0000000 with your grep output value. In case it is not found in grep output, save full kernel boot log and search in it

Set destination address for DMA

For this, grep for the string “Allocated memory for DMA operation” in dmesg log of host system console (i.e. current system) and use whatever address comes up in the grep output

#dmesg | grep " Allocated memory for DMA"

an example output would be something like

tegra_ep_mem 0005:01:00.0: Allocated memory for DMA operation @ 0x80000000, size=0x20000000

So, use 0x80000000 as source address

#echo 0x80000000 > dst

Note : - don’t forget to replace 0xC0000000 with your grep output value. In case it is not found in grep output, save full kernel boot log and search in it

Execute write test

#cat write

It prints the output in the following format

tegra_ep_mem 0000:01:00.0: DMA write: Size: 536870912 bytes, Time diff: 296565536 ns

Read test can be performed by interchanging ‘src’ and ‘dst’ and executing ‘cat read’ command.

Hi Jasonm
Thank you so much for such a detailed guide,Following the steps you suggested, I successfully got the following information

#cat write
[  253.589895] tegra_ep_mem 0005:01:00.0: DMA write: Size: 536870912 bytes, Time diff: 37065888 ns

But for this way, can you please guide how should I program to communicate between Xavier A and XavierB? For example: I have a 10G test.data, and I want to send this file from Xavier B (EP) to Xavier A (RP) through PCIE DMA, what should I do? Are there any relevant examples or instruction manuals?
BR.,

I have one more question, follow your guide steps
About RP Mode DMA
In the /sys/kernel/debug/pcie-0 file directory,After the premise configuration is configured, when I execute cat write, I don’t see any output, Execute cat write without seeing any output,Is this normal?

Hi Jasonm
we are in urgent need of your help now. Regarding the issues I mentioned earlier, I hope you can guide us on what to do?

Hi NOPUFF,
1.From your test log, the
bandwidth 536870912 /37065888 = 14.48GByte/sec = 115Gbps.
Far more than the 10Gbps of your objective.
2.Would you please check the patch which we have provided and related code? Base on that to implement your own driver/app. We have no resource to provide customer’s use case code.

BR.,
Jasonm

Yes, the bandwidth is indeed satisfied from the test results. Regarding such a usage method: I have a 10G test.data, I want to send this file from Xavier B (EP) to Xavier A (RP) through PCIE DMA, Can you give me a simple guide on how to do it? Are there any related documents for the operation of Xavier PCIE DMA?

Hi NOPUFF,
1.Regarding such a usage method: I have a 10G test.data,
[NV] We need to load the test.data into memory and pass the load address(if the data from camera, the address of camera data buffer) to PCIe DMA start address.

2.I want to send this file from Xavier B (EP) to Xavier A (RP) through PCIE DMA,
[NV]Would you please check PCIE related code, the sample code already included full PCIE DMA operations ? for example:
In nvidia/drivers/pci/endpoint/functions/pci-epf-tegra-vnet.c :
/* Trigger DMA write from src_iova to dst_iova */

    ep_dma_virt[desc_widx].size = len;  //DMA size
    ep_dma_virt[desc_widx].sar_low = lower_32_bits(src_iova); // DMA 64bit source address
    ep_dma_virt[desc_widx].sar_high = upper_32_bits(src_iova); 
    ep_dma_virt[desc_widx].dar_low = lower_32_bits(dst_iova); //DMA 64bit destination address 
    ep_dma_virt[desc_widx].dar_high = upper_32_bits(dst_iova); 
    /* CB bit should be set at the end */

BR.,
Jasonm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.