I have two AGX Orin Dev Kits connected via a CAT6 ethernet cable to test the networking speeds which are supposed to be up to 10 Gbps. I am using iperf2 to run these tests due to iperf3 potentially having some issues with multithreading. I run the tests using 12 threads, while each Orin is set to MAXN power mode with jetson clocks enabled. However, the best performance I am able to achieve is around 8.06 Gbps:
The client of the iperf tests experiences near 100% CPU usage of CPU0. I assume this is the bottleneck.
I figured this meant that iperf was not being allowed to run on multiple cores. However, it seems that the affinity of the iperf process is set to any of the CPUs.
With those flags, and running with 12 threads I am able to get closer to that 10 Gbits/sec target. I get max around 9.6 Gbps which is probably as good as I should expect. However, even with 12 threads, CPU0 is still at 100% usage, while other cores are mostly free. In addition, each time I run the test the speed can vary from about 8.3 Gbps to 9.6 Gbps, and the average is around 9 Gbps. I need to be able to get a consistent >9.5 Gbps at the minimum.
Can someone from NVIDIA please confirm whether this is a hardware limitation or not? UDP performs better with the smaller headers, but it still can not reach a consistent >9.5 Gbps.
FYI, it is often best for a driver to stick to one CPU core. That is because of cache hits, versus cache misses. Any time you migrate to a new core you get a cache miss, and the cache has to be filled again. On a single core you are not guaranteed of a cache hit, but it is much more likely (versus impossible when migrating).
There is a problem though when two independent hardware devices use the same core. It would be better if both lived on a single, but separate core. In order to accomplish that you need to be able to service a hardware interrupt (an actual wire to the device, along with I/O to the core, and not just a simple timer on a software process). If the wiring does not exist, then a hardware IRQ cannot migrate to another core. The scheduler can be told to put that process on a specific “other” core (IRQ affinity), but when it comes time to use the core, the scheduler will migrate the process back to a core with the actual wiring.
Sometime to consider is to learn about “isolcpus” and “sched_setaffinity” (CPU affinity). Then make sure any software process is on a different core (for user space that is obvious, it is a PID, but for kernel space it is anything running via ksoftirqd, a software IRQ scheduler). There are many hardware IRQs which can only run on CPU0. Examine hardware IRQ stats via: cat /proc/interrupt
As an example, perhaps any software using significant CPU power (e.g., benchmarking software) could be set with affinity evenly spread on cores 1 through 11, and denied on core 0.
Interesting. I suppose that is why something like RDMA is so useful. I suppose this would be a good workaround for using the dev kit in actual production without adding something like ConnectX.
If RDMA is a purely software process, then moving it to a new core would help both CPU0 IRQs and RDMA. The benefit to CPU0 would be indirect, and it would depend on how much load RDMA produces. But that is the correct way of thinking about it.
Incidentally, a good hardware driver design performs the minimum possible function on the hardware itself, and then separates out other work such that the hardware IRQ is scheduled separately from the software IRQ produced by follow-up work. A contrived example is a network device; consider that the hardware must be accessed via physical address for reading or writing, but if a checksum is used on the result, then the checksum could be placed as a separate software driver. Access to the hardware in that case would be separate from checksums, and thus shorten the time the hardware is in some atomic access state (versus taking longer if the same IRQ also triggers the checksum). For the software half of this the checksum could be migrated to another core via ksoftirqd scheduling.