Orin Unable to Reach 10 Gbps Networking

I have two AGX Orin Dev Kits connected via a CAT6 ethernet cable to test the networking speeds which are supposed to be up to 10 Gbps. I am using iperf2 to run these tests due to iperf3 potentially having some issues with multithreading. I run the tests using 12 threads, while each Orin is set to MAXN power mode with jetson clocks enabled. However, the best performance I am able to achieve is around 8.06 Gbps:
Screenshot from 2023-08-02 15-13-47

The client of the iperf tests experiences near 100% CPU usage of CPU0. I assume this is the bottleneck.

I figured this meant that iperf was not being allowed to run on multiple cores. However, it seems that the affinity of the iperf process is set to any of the CPUs.

I have seen the following posts:

I set the MTU of eth0 to 9000, and same result.

I attempted all solutions posted, but same result.

Is the hardware of the Orin limited to run eth0 only on core0? If so, is there some sort of work-around?

Hi,
Please set MTU=9000 and try the command:

$ iperf3 -c 10.0.0.6 -u -b 0 -l 65507
warning: UDP block size 65507 exceeds TCP MSS 8914, may result in fragmentation / drops
Connecting to host 10.0.0.6, port 5201
[  5] local 10.0.0.1 port 52046 connected to 10.0.0.6 port 5201
[ ID] Interval           Transfer     Bitrate         Total Datagrams
[  5]   0.00-1.00   sec  1.07 GBytes  9.16 Gbits/sec  17490  
[  5]   1.00-2.00   sec   772 MBytes  6.48 Gbits/sec  12360  
[  5]   2.00-3.00   sec   807 MBytes  6.77 Gbits/sec  12920  
[  5]   3.00-4.00   sec   811 MBytes  6.80 Gbits/sec  12980  
[  5]   4.00-5.00   sec  1.07 GBytes  9.18 Gbits/sec  17520  
[  5]   5.00-6.00   sec  1.07 GBytes  9.17 Gbits/sec  17510  
[  5]   6.00-7.00   sec  1.07 GBytes  9.18 Gbits/sec  17520  
[  5]   7.00-8.00   sec  1.07 GBytes  9.15 Gbits/sec  17460  
[  5]   8.00-9.00   sec  1.07 GBytes  9.17 Gbits/sec  17500  
[  5]   9.00-10.00  sec  1.06 GBytes  9.14 Gbits/sec  17440  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-10.00  sec  9.80 GBytes  8.42 Gbits/sec  0.000 ms  0/160700 (0%)  sender
[  5]   0.00-10.00  sec  9.78 GBytes  8.40 Gbits/sec  0.068 ms  427/160700 (0.27%)  receiver

iperf Done.

With those flags, and running with 12 threads I am able to get closer to that 10 Gbits/sec target. I get max around 9.6 Gbps which is probably as good as I should expect. However, even with 12 threads, CPU0 is still at 100% usage, while other cores are mostly free. In addition, each time I run the test the speed can vary from about 8.3 Gbps to 9.6 Gbps, and the average is around 9 Gbps. I need to be able to get a consistent >9.5 Gbps at the minimum.

Can someone from NVIDIA please confirm whether this is a hardware limitation or not? UDP performs better with the smaller headers, but it still can not reach a consistent >9.5 Gbps.

Hi,
In our tests, we have seen worse throughput while using Ubuntu desktop. Could you please try minimal flavor rootfs:
Root File System — Jetson Linux Developer Guide documentation

And see if you can get steady throughput. We think the complex GUI may have impact to certain use-cases.

Here are the results running “iperf3 -c 10.0.0.1 -u -b 0 -l 65507 -P 12” on a minimal flavor rootfs with the latest jetson 35.4.1:
iperf3-10GB

This is perfect.

But I still can’t help but notice that its one CPU doing all the work even with 12 threads:
image

Luckily we are only using the dev kit for testing, and as long as it is not bottlenecking the speeds, using 1 CPU is ok.

Marking your answer as the solution. Thanks for the help!

1 Like

FYI, it is often best for a driver to stick to one CPU core. That is because of cache hits, versus cache misses. Any time you migrate to a new core you get a cache miss, and the cache has to be filled again. On a single core you are not guaranteed of a cache hit, but it is much more likely (versus impossible when migrating).

There is a problem though when two independent hardware devices use the same core. It would be better if both lived on a single, but separate core. In order to accomplish that you need to be able to service a hardware interrupt (an actual wire to the device, along with I/O to the core, and not just a simple timer on a software process). If the wiring does not exist, then a hardware IRQ cannot migrate to another core. The scheduler can be told to put that process on a specific “other” core (IRQ affinity), but when it comes time to use the core, the scheduler will migrate the process back to a core with the actual wiring.

Sometime to consider is to learn about “isolcpus” and “sched_setaffinity” (CPU affinity). Then make sure any software process is on a different core (for user space that is obvious, it is a PID, but for kernel space it is anything running via ksoftirqd, a software IRQ scheduler). There are many hardware IRQs which can only run on CPU0. Examine hardware IRQ stats via:
cat /proc/interrupt

As an example, perhaps any software using significant CPU power (e.g., benchmarking software) could be set with affinity evenly spread on cores 1 through 11, and denied on core 0.

1 Like

Interesting. I suppose that is why something like RDMA is so useful. I suppose this would be a good workaround for using the dev kit in actual production without adding something like ConnectX.

If RDMA is a purely software process, then moving it to a new core would help both CPU0 IRQs and RDMA. The benefit to CPU0 would be indirect, and it would depend on how much load RDMA produces. But that is the correct way of thinking about it.

Incidentally, a good hardware driver design performs the minimum possible function on the hardware itself, and then separates out other work such that the hardware IRQ is scheduled separately from the software IRQ produced by follow-up work. A contrived example is a network device; consider that the hardware must be accessed via physical address for reading or writing, but if a checksum is used on the result, then the checksum could be placed as a separate software driver. Access to the hardware in that case would be separate from checksums, and thus shorten the time the hardware is in some atomic access state (versus taking longer if the same IRQ also triggers the checksum). For the software half of this the checksum could be migrated to another core via ksoftirqd scheduling.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.