Jetson ORIN NX iperf3 zero-copy

Hello,
I have a Jetson Orin-NX Developer Kit running 5.10.120-tegra. I have a 25Gbe NVIDIA CX4-LX on the NX. I have attempted a couple of iperf3 TCP tests - buffered and zero-copy (-Z option).

I am seeing odd results with iperf3’s -Z option on the NX. Below are the observed results.

Buffered

iperf3 -c -t 10 -i 5 -P 1 -f M


[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 28.7 GBytes 2938 MBytes/sec 0 sender
[ 5] 0.00-10.00 sec 28.7 GBytes 2938 MBytes/sec receiver

Zero-copy

iperf3 -c -t 10 -i 5 -P 1 -f M -Z


[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 13.4 GBytes 1374 MBytes/sec 0 sender
[ 5] 0.00-10.00 sec 13.4 GBytes 1374 MBytes/sec receiver

The -Z option results in using splice and sendfile in the kernel which is not observed with a buffered send. For sanity I tested iperf3 from a x86_64 to the NX which performed well ~2828 MBps.

I have uploaded CPU flamegraphs of 10 sec periods while the NX is sending in buffered and in zerocopy mode. My understanding of the zerocopy flamegraph is that it appears the CPU is spending a a lot of time oddly in _raw_spinlock_unlock_irqrestore via the arm_smmu_tlb_sync_context.

iperf3-nx-tcp-buffered-send.pdf (656.3 KB)

iperf3-nx-zerocopy-tcp-send.pdf (611.4 KB)

I wanted to see if anybody has experienced this behavior and what approach was used to investigate/resolve the behavior.

Thanks,
vangogh

1 Like

Hi,
We run the commands for profiling:

iperf3 -c <ip> -b 0 -l 16K -t 120 -i 1

And do not set -Z . Please try the command and see if you can achieve target performance. And please execute sudo nvpmodel -m 0 and sudo jetson_clocks before the profiling.

1 Like

Hi DaneLLL,
Thank you for your response.

In my NX development kit test configuration nvpmodel -m 0 and jetson_clock are already performed. I am able to get ~2717 MBps for 16K buffered TCP sends, which looks okay.

Under the same configuration I am observing low rates when the Jetson Developer Kit operates as a NFS server and is performing a TCP transmit in response to a remote NFS client performing a READ. To remove Storage as potential cause the NFS server exports a ramdisk. I have NFS 4.2 at both ends. Similar results were observed with NFS 3.

NFS server

mount -t tmpfs -o size=12G none /mnt/nfs
fallocate -l 12G /mnt/nfs/testfile
systemctl restart nfs-kernel-server

I have a NFS client on x86_64 that mounts the NFS export as;
mount -tnfs -o nconnect=8 :/mnt/nfs /mnt/nv_nfs

The NFS client uses “fio” to perform a READ of a testfile that is in the server’s exported file-system.

NFS READ command
fio --name=fio_test --filename=/mnt/nv_nfs/testfile --rw=read --direct=1 --size=100% --bs=1M --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=8 --time_based --group_reporting

fio_test: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=64

fio-3.16
Starting 8 processes
Jobs: 8 (f=8): [R(8)][100.0%][r=1604MiB/s][r=1603 IOPS][eta 00m:00s]
fio_test: (groupid=0, jobs=8): err= 0: pid=14423: Fri Mar 15 08:47:54 2024
read: IOPS=1694, BW=1695MiB/s (1777MB/s)(99.9GiB/60324msec)
slat (usec): min=44, max=12465, avg=110.93, stdev=106.21
clat (msec): min=30, max=683, avg=301.74, stdev=27.47
lat (msec): min=30, max=683, avg=301.85, stdev=27.44
clat percentiles (msec):
| 1.00th=[ 268], 5.00th=[ 275], 10.00th=[ 279], 20.00th=[ 288],
| 30.00th=[ 292], 40.00th=[ 296], 50.00th=[ 300], 60.00th=[ 305],
| 70.00th=[ 309], 80.00th=[ 317], 90.00th=[ 326], 95.00th=[ 338],
| 99.00th=[ 376], 99.50th=[ 405], 99.90th=[ 584], 99.95th=[ 617],
| 99.99th=[ 659]
bw ( MiB/s): min= 1540, max= 1810, per=100.00%, avg=1695.39, stdev= 6.25, samples=960
iops : min= 1540, max= 1810, avg=1695.13, stdev= 6.25, samples=960
lat (msec) : 50=0.08%, 100=0.10%, 250=0.27%, 500=99.32%, 750=0.24%
cpu : usr=0.23%, sys=2.48%, ctx=104256, majf=0, minf=131163
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.5%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=102247,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
READ: bw=1695MiB/s (1777MB/s), 1695MiB/s-1695MiB/s (1777MB/s-1777MB/s), io=99.9GiB (107GB), run=60324-60324msec

Here the nconnect=8 mount option provides a boost of ~400 MBps. Without nconnect NFS uses a single socket and the rate observed is ~1400 MBps. Oddly, this rate is very similar to that of the TCP transmit rate observed using iperf3 -Z. At this point a plausible explanation appears to be that the NFS server is taking a kernel path that is similar to that produced by iperf3 -Z (zero-copy) test and therefore experiencing a similar limitation on Orin NX Development Kit platform.

I wanted to see if anybody is observing a similar behavior or had any thoughts regarding the NFS test and whether there was a mechanism to alleviate such behavior.

Thank you.