Orin Unable to Reach 10 Gbps Networking

wrogers · August 2, 2023, 10:45pm

I have two AGX Orin Dev Kits connected via a CAT6 ethernet cable to test the networking speeds which are supposed to be up to 10 Gbps. I am using iperf2 to run these tests due to iperf3 potentially having some issues with multithreading. I run the tests using 12 threads, while each Orin is set to MAXN power mode with jetson clocks enabled. However, the best performance I am able to achieve is around 8.06 Gbps:

The client of the iperf tests experiences near 100% CPU usage of CPU0. I assume this is the bottleneck.

I figured this meant that iperf was not being allowed to run on multiple cores. However, it seems that the affinity of the iperf process is set to any of the CPUs.

I have seen the following posts:

I set the MTU of eth0 to 9000, and same result.

I attempted all solutions posted, but same result.

Is the hardware of the Orin limited to run eth0 only on core0? If so, is there some sort of work-around?

DaneLLL · August 4, 2023, 8:10am

Hi,
Please set MTU=9000 and try the command:

$ iperf3 -c 10.0.0.6 -u -b 0 -l 65507
warning: UDP block size 65507 exceeds TCP MSS 8914, may result in fragmentation / drops
Connecting to host 10.0.0.6, port 5201
[  5] local 10.0.0.1 port 52046 connected to 10.0.0.6 port 5201
[ ID] Interval           Transfer     Bitrate         Total Datagrams
[  5]   0.00-1.00   sec  1.07 GBytes  9.16 Gbits/sec  17490  
[  5]   1.00-2.00   sec   772 MBytes  6.48 Gbits/sec  12360  
[  5]   2.00-3.00   sec   807 MBytes  6.77 Gbits/sec  12920  
[  5]   3.00-4.00   sec   811 MBytes  6.80 Gbits/sec  12980  
[  5]   4.00-5.00   sec  1.07 GBytes  9.18 Gbits/sec  17520  
[  5]   5.00-6.00   sec  1.07 GBytes  9.17 Gbits/sec  17510  
[  5]   6.00-7.00   sec  1.07 GBytes  9.18 Gbits/sec  17520  
[  5]   7.00-8.00   sec  1.07 GBytes  9.15 Gbits/sec  17460  
[  5]   8.00-9.00   sec  1.07 GBytes  9.17 Gbits/sec  17500  
[  5]   9.00-10.00  sec  1.06 GBytes  9.14 Gbits/sec  17440  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-10.00  sec  9.80 GBytes  8.42 Gbits/sec  0.000 ms  0/160700 (0%)  sender
[  5]   0.00-10.00  sec  9.78 GBytes  8.40 Gbits/sec  0.068 ms  427/160700 (0.27%)  receiver

iperf Done.

wrogers · August 7, 2023, 4:42pm

With those flags, and running with 12 threads I am able to get closer to that 10 Gbits/sec target. I get max around 9.6 Gbps which is probably as good as I should expect. However, even with 12 threads, CPU0 is still at 100% usage, while other cores are mostly free. In addition, each time I run the test the speed can vary from about 8.3 Gbps to 9.6 Gbps, and the average is around 9 Gbps. I need to be able to get a consistent >9.5 Gbps at the minimum.

wrogers · August 21, 2023, 3:42pm

Can someone from NVIDIA please confirm whether this is a hardware limitation or not? UDP performs better with the smaller headers, but it still can not reach a consistent >9.5 Gbps.

DaneLLL · August 23, 2023, 3:51am

Hi,
In our tests, we have seen worse throughput while using Ubuntu desktop. Could you please try minimal flavor rootfs:
Root File System — Jetson Linux Developer Guide documentation

And see if you can get steady throughput. We think the complex GUI may have impact to certain use-cases.

wrogers · August 23, 2023, 7:05pm

Here are the results running “iperf3 -c 10.0.0.1 -u -b 0 -l 65507 -P 12” on a minimal flavor rootfs with the latest jetson 35.4.1:
iperf3-10GB

This is perfect.

But I still can’t help but notice that its one CPU doing all the work even with 12 threads:

Luckily we are only using the dev kit for testing, and as long as it is not bottlenecking the speeds, using 1 CPU is ok.

Marking your answer as the solution. Thanks for the help!

linuxdev · August 24, 2023, 7:11pm

FYI, it is often best for a driver to stick to one CPU core. That is because of cache hits, versus cache misses. Any time you migrate to a new core you get a cache miss, and the cache has to be filled again. On a single core you are not guaranteed of a cache hit, but it is much more likely (versus impossible when migrating).

There is a problem though when two independent hardware devices use the same core. It would be better if both lived on a single, but separate core. In order to accomplish that you need to be able to service a hardware interrupt (an actual wire to the device, along with I/O to the core, and not just a simple timer on a software process). If the wiring does not exist, then a hardware IRQ cannot migrate to another core. The scheduler can be told to put that process on a specific “other” core (IRQ affinity), but when it comes time to use the core, the scheduler will migrate the process back to a core with the actual wiring.

Sometime to consider is to learn about “isolcpus” and “sched_setaffinity” (CPU affinity). Then make sure any software process is on a different core (for user space that is obvious, it is a PID, but for kernel space it is anything running via ksoftirqd, a software IRQ scheduler). There are many hardware IRQs which can only run on CPU0. Examine hardware IRQ stats via:
cat /proc/interrupt

As an example, perhaps any software using significant CPU power (e.g., benchmarking software) could be set with affinity evenly spread on cores 1 through 11, and denied on core 0.

wrogers · August 25, 2023, 12:37am

Interesting. I suppose that is why something like RDMA is so useful. I suppose this would be a good workaround for using the dev kit in actual production without adding something like ConnectX.

linuxdev · August 25, 2023, 7:15pm

If RDMA is a purely software process, then moving it to a new core would help both CPU0 IRQs and RDMA. The benefit to CPU0 would be indirect, and it would depend on how much load RDMA produces. But that is the correct way of thinking about it.

Incidentally, a good hardware driver design performs the minimum possible function on the hardware itself, and then separates out other work such that the hardware IRQ is scheduled separately from the software IRQ produced by follow-up work. A contrived example is a network device; consider that the hardware must be accessed via physical address for reading or writing, but if a checksum is used on the result, then the checksum could be placed as a separate software driver. Access to the hardware in that case would be separate from checksums, and thus shorten the time the hardware is in some atomic access state (versus taking longer if the same IRQ also triggers the checksum). For the software half of this the checksum could be migrated to another core via ksoftirqd scheduling.

system · September 8, 2023, 7:15pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
10G ethernet testing of Jetson AGX Orin Developer Kit Jetson AGX Orin nvbugs , ethernet	4	3848	November 8, 2022
About the test of 10 Gigabit network port Jetson AGX Orin ethernet , networking	5	1018	March 17, 2023
AGX Orin ethernet receive speed can not up to 10Gbps Jetson AGX Orin ethernet	5	207	March 26, 2025
High CPU interrupts limit 10G port's performance to ~5.3Gb/s Jetson AGX Orin ethernet	7	2108	June 15, 2022
How to maximize bitrate with RJ45 between the two Jetson AGX Orin Devkits Jetson AGX Orin networking	2	598	October 4, 2023
The maximum speed of the 10G network port of the ORIN-32GB module can only reach 6.5G using iperf3 Jetson AGX Orin ethernet , networking	7	1180	April 24, 2025
Orin TCP 和 UDP 吞吐率对比 Jetson AGX Orin ethernet , chinese	2	342	April 26, 2024
Jetson AGX Orin development kit10GB eth PHY bandwidth traffic testing Jetson AGX Orin tools	2	416	January 26, 2024
Orin Dev Kit Ethernet over PCIe EP performance Jetson AGX Orin pcie	4	2056	March 31, 2023
How to maximize bitrate on the jetson orin Jetson AGX Orin networking	10	1095	October 26, 2022

Orin Unable to Reach 10 Gbps Networking

Related topics