PCIe read/write performance of xilinx FPGA card is slower than on x86_64

I used an xdma FPGA performance test program.
I tested it on Orin AGX, and the speed was only half that of x86_64 pc.
// The xdma module parameters are consistent, and no exception is found in dmesg.
Any thoughts on why the difference is so big?

The same program, the same version of xdma driver (2020.2), on x86_64 PC (12c, 16G)

FPGA PCIe LnkSta: Speed 5GT/s (ok), Width x8 (ok)

./st_speed -d /dev/xdma0_c2h_0 -r -n 0x80
pktnum: 128
recv speed: 2723.40 MB/s 8000000 Byte
recv speed: 2723.40 MB/s 10000000 Byte
recv speed: 2723.40 MB/s 18000000 Byte

./st_speed -d /dev/xdma0_h2c_0 -w -n 0x80
pktnum: 128
send speed: 2723.40 MB/s 8000000 Byte
send speed: 2782.61 MB/s 10000000 Byte
send speed: 2782.61 MB/s 18000000 Byte

But on Orin (12c, 64G), the FPGA card is inserted into Orin C5 slot,
the speed is only half of x86_64.

./st_speed -d /dev/xdma0_c2h_0 -r -n 0x80
pktnum: 128
recv speed: 1376.34 MB/s 8000000 Byte
recv speed: 1391.30 MB/s 10000000 Byte
recv speed: 1376.34 MB/s 18000000 Byte
recv speed: 1391.30 MB/s 20000000 Byte
recv speed: 1422.22 MB/s 28000000 Byte

./st_speed -d /dev/xdma0_h2c_0 -w -n 0x80
pktnum: 128
send speed: 1219.05 MB/s 8000000 Byte
send speed: 1219.05 MB/s 10000000 Byte
send speed: 1219.05 MB/s 18000000 Byte
send speed: 1219.05 MB/s 20000000 Byte
send speed: 1230.77 MB/s 28000000 Byte
send speed: 1230.77 MB/s 30000000 Byte

Orin memory is poor? but jetson orin agx is lpddr5 memory, should be better than PC ddr4 memory?

Would you like to explain why dd speed on Orin is only half of that on x86-64 PC?

on Orin:
dd if=/dev/zero of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.0657541 s, 16.3 GB/s

on x86_64 PC:
dd if=/dev/zero of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes(1.1 GB,1.0 GiB)copied,0.034232 s,31.4 GB/s

sysbench --test=memory run # on orin
Total operations: 40834833 (4082796.18 per second)
39877.77 MiB transferred (3987.11 MiB/sec)

sysbench --test=memory run # on same x86_64 PC
Total operations: 104857600 (10828779.36 per second)
102400.00 MiB transferred (10574.98 MiB/sec)

Hi,
It looks like the deviation is from CPU capability. You can run sudo tegrastats on Orin and check if some CPU cores are at maximum loading. And the latest production release is Jetpack 5.1.3. If you use previous release, may upgrade to the latest version and try.

I flashed to latest release R36, but it seems performance is degrading.

./st_speed -d /dev/xdma0_c2h_0 -r -n 0x80
pkt num: 128
recv speed: 934.31 MB/s 8000000 Byte
recv speed: 882.76 MB/s 10000000 Byte
recv speed: 859.06 MB/s 18000000 Byte
recv speed: 859.06 MB/s 20000000 Byte
recv speed: 859.06 MB/s 28000000 Byte

nvpmodel -q
NV Power Mode: MAXN
0


tegrastats
03-11-2024 17:28:58 RAM 1939/62841MB (lfb 4x4MB) SWAP 0/31421MB (cached 0MB) CPU [20%@729,6%@729,34%@729,17%@729,0%@729,0%@729,0%@729,0%@729,0%@729,1%@729,0%@729,2%@729] EMC_FREQ 3%@665 GR3D_FREQ 0%@[0,0] NVENC off NVDEC off NVJPG off NVJPG1 off VIC off OFA off NVDLA0 off NVDLA1 off PVA0_FREQ off APE 174 cpu@47.593C tboard@36.875C soc2@44.187C tdiode@38.25C soc0@45.593C tj@47.593C soc1@44.062C VDD_GPU_SOC 2405mW/2405mW VDD_CPU_CV 400mW/400mW VIN_SYS_5V0 4550mW/4550mW VDDQ_VDD2_1V8AO 707mW/707mW
03-11-2024 17:28:59 RAM 1937/62841MB (lfb 4x4MB) SWAP 0/31421MB (cached 0MB) CPU [18%@729,10%@729,40%@729,3%@729,7%@729,0%@729,0%@729,1%@729,0%@729,0%@729,2%@729,0%@729] EMC_FREQ 3%@665 GR3D_FREQ 0%@[0,0] NVENC off NVDEC off NVJPG off NVJPG1 off VIC off OFA off NVDLA0 off NVDLA1 off PVA0_FREQ off APE 174 cpu@47.843C tboard@36.875C soc2@44.093C tdiode@38.375C soc0@45.406C tj@47.843C soc1@43.843C VDD_GPU_SOC 2405mW/2405mW VDD_CPU_CV 400mW/400mW VIN_SYS_5V0 4651mW/4600mW VDDQ_VDD2_1V8AO 808mW/757mW
03-11-2024 17:29:00 RAM 1938/62841MB (lfb 4x4MB) SWAP 0/31421MB (cached 0MB) CPU [17%@729,0%@729,56%@729,0%@729,1%@729,0%@729,0%@729,0%@729,0%@1267,0%@1036,0%@1036,0%@1036] EMC_FREQ 1%@2133 GR3D_FREQ 0%@[0,0] NVENC off NVDEC off NVJPG off NVJPG1 off VIC off OFA off NVDLA0 off NVDLA1 off PVA0_FREQ off APE 174 cpu@47.75C tboard@36.875C soc2@44.312C tdiode@38.5C soc0@45.687C tj@47.75C soc1@44.031C VDD_GPU_SOC 2806mW/2538mW VDD_CPU_CV 400mW/400mW VIN_SYS_5V0 4752mW/4651mW VDDQ_VDD2_1V8AO 808mW/774mW

I also compared with other Arm64 server, whose dd speed and sysbench result is roughly the same as Orin.
But Arm64 server xdma performance is 2 times of that on Orin.

dd if=/dev/zero of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.0967686 s, 11.1 GB/s

sysbench --test=memory run
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 1KiB
  total size: 102400MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

Total operations: 43201886 (4319620.61 per second)

42189.34 MiB transferred (4218.38 MiB/sec)


General statistics:
    total time:                          10.0001s
    total number of events:              43201886

Latency (ms):
         min:                                    0.00
         avg:                                    0.00
         max:                                    0.05
         95th percentile:                        0.00
         sum:                                 4351.84

Threads fairness:
    events (avg/stddev):           43201886.0000/0.00
    execution time (avg/stddev):   4.3518/0.00

./st_speed -d /dev/xdma0_c2h_0 -r -n 0x80
pkt num: 128
recv speed: 2206.90 MB/s 8000000 Byte
recv speed: 2206.90 MB/s 10000000 Byte
recv speed: 2206.90 MB/s 18000000 Byte

./st_speed -d /dev/xdma0_h2c_0 -w -n 0x80
pkt num: 128
send speed: 2031.75 MB/s 8000000 Byte
send speed: 2031.75 MB/s 10000000 Byte
send speed: 2031.75 MB/s 18000000 Byte

Jetpack 5.1.3 is R35.5.0. Compared to R35.4.1, xdma performance remains basically unchanged.

sudo ./st_speed -d /dev/xdma0_c2h_0 -r -n 0x80
pkt num: 128
recv speed: 1391.30 MB/s 8000000 Byte
recv speed: 1391.30 MB/s 10000000 Byte
recv speed: 1391.30 MB/s 18000000 Byte
recv speed: 1391.30 MB/s 20000000 Byte
recv speed: 1391.30 MB/s 28000000 Byte
recv speed: 1391.30 MB/s 30000000 Byte
recv speed: 1406.59 MB/s 38000000 Byte

sudo nvpmodel -q
NV Power Mode: MAXN
0

tegrastats
03-12-2024 15:25:51 RAM 1674/62780MB (lfb 15048x4MB) SWAP 0/31390MB (cached 0MB) CPU [13%@2201,54%@2201,2%@2201,30%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201] EMC_FREQ 0%@3199 GR3D_FREQ 0%@[1300,1300] VIC_FREQ 729 APE 174 CV0@-256C CPU@51.562C Tboard@38C SOC2@47.062C Tdiode@40.75C SOC0@47.718C CV1@-256C GPU@46.156C tj@51.093C SOC1@45.75C CV2@-256C VDD_GPU_SOC 4811mW/4811mW VDD_CPU_CV 2004mW/2004mW VIN_SYS_5V0 5359mW/5359mW VDDQ_VDD2_1V8AO 1011mW/1011mW
03-12-2024 15:25:52 RAM 1674/62780MB (lfb 15048x4MB) SWAP 0/31390MB (cached 0MB) CPU [10%@2201,0%@2201,30%@2201,59%@2201,0%@2201,0%@2201,0%@2201,0%@1984,0%@2201,0%@2201,0%@2201,0%@2201] EMC_FREQ 0%@3199 GR3D_FREQ 0%@[1300,1300] VIC_FREQ 729 APE 174 CV0@-256C CPU@50.968C Tboard@38C SOC2@47.093C Tdiode@40.75C SOC0@47.812C CV1@-256C GPU@46.375C tj@50.968C SOC1@45.937C CV2@-256C VDD_GPU_SOC 4811mW/4811mW VDD_CPU_CV 2004mW/2004mW VIN_SYS_5V0 5359mW/5359mW VDDQ_VDD2_1V8AO 1011mW/1011mW
03-12-2024 15:25:53 RAM 1674/62780MB (lfb 15048x4MB) SWAP 0/31390MB (cached 0MB) CPU [13%@2201,0%@2201,35%@2201,49%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201] EMC_FREQ 0%@3199 GR3D_FREQ 0%@[1300,1300] VIC_FREQ 729 APE 174 CV0@-256C CPU@51.156C Tboard@38C SOC2@46.968C Tdiode@40.75C SOC0@47.812C CV1@-256C GPU@46.343C tj@51.156C SOC1@45.875C CV2@-256C VDD_GPU_SOC 4811mW/4811mW VDD_CPU_CV 2004mW/2004mW VIN_SYS_5V0 5359mW/5359mW VDDQ_VDD2_1V8AO 1011mW/1011mW
03-12-2024 15:25:54 RAM 1676/62780MB (lfb 15048x4MB) SWAP 0/31390MB (cached 0MB) CPU [14%@2201,87%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2406,0%@2201] EMC_FREQ 0%@3199 GR3D_FREQ 0%@[1300,1300] VIC_FREQ 729 APE 174 CV0@-256C CPU@51.531C Tboard@38C SOC2@47.125C Tdiode@40.75C SOC0@47.906C CV1@-256C GPU@46.343C tj@51.531C SOC1@45.718C CV2@-256C VDD_GPU_SOC 4811mW/4811mW VDD_CPU_CV 2004mW/2004mW VIN_SYS_5V0 5359mW/5359mW VDDQ_VDD2_1V8AO 1011mW/1011mW
03-12-2024 15:25:55 RAM 1676/62780MB (lfb 15048x4MB) SWAP 0/31390MB (cached 0MB) CPU [14%@2201,75%@2201,10%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201] EMC_FREQ 0%@3199 GR3D_FREQ 0%@[1300,1300] VIC_FREQ 729 APE 174 CV0@-256C CPU@51.531C Tboard@38C SOC2@47.062C Tdiode@40.75C SOC0@47.718C CV1@-256C GPU@46.062C tj@51.531C SOC1@45.718C CV2@-256C VDD_GPU_SOC 4811mW/4811mW VDD_CPU_CV 2004mW/2004mW VIN_SYS_5V0 5359mW/5359mW VDDQ_VDD2_1V8AO 1011mW/1011mW
03-12-2024 15:25:56 RAM 1669/62780MB (lfb 15048x4MB) SWAP 0/31390MB (cached 0MB) CPU [22%@2026,0%@2201,77%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201] EMC_FREQ 0%@3199 GR3D_FREQ 0%@[1300,1300] VIC_FREQ 729 APE 174 CV0@-256C CPU@51.093C Tboard@38C SOC2@47.312C Tdiode@40.75C SOC0@47.687C CV1@-256C GPU@46.687C tj@51.093C SOC1@45.75C CV2@-256C VDD_GPU_SOC 4811mW/4811mW VDD_CPU_CV 2004mW/2004mW VIN_SYS_5V0 5359mW/5359mW VDDQ_VDD2_1V8AO 1011mW/1011mW
03-12-2024 15:25:57 RAM 1669/62780MB (lfb 15048x4MB) SWAP 0/31390MB (cached 0MB) CPU [12%@2201,0%@2201,87%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2356] EMC_FREQ 0%@3199 GR3D_FREQ 0%@[1300,1300] VIC_FREQ 729 APE 174 CV0@-256C CPU@51.468C Tboard@38C SOC2@47.093C Tdiode@40.75C SOC0@47.812C CV1@-256C GPU@46.687C tj@51.031C SOC1@45.75C CV2@-256C VDD_GPU_SOC 4811mW/4811mW VDD_CPU_CV 2004mW/2004mW VIN_SYS_5V0 5359mW/5359mW VDDQ_VDD2_1V8AO 1011mW/1011mW
03-12-2024 15:25:58 RAM 1669/62780MB (lfb 15048x4MB) SWAP 0/31390MB (cached 0MB) CPU [15%@2201,0%@2201,86%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201] EMC_FREQ 0%@3199 GR3D_FREQ 0%@[1300,1300] VIC_FREQ 729 APE 174 CV0@-256C CPU@51.062C Tboard@38C SOC2@47.062C Tdiode@40.75C SOC0@47.843C CV1@-256C GPU@46.593C tj@51.062C SOC1@45.875C CV2@-256C VDD_GPU_SOC 4811mW/4811mW VDD_CPU_CV 2004mW/2004mW VIN_SYS_5V0 5460mW/5371mW VDDQ_VDD2_1V8AO 1011mW/1011mW
03-12-2024 15:25:59 RAM 1669/62780MB (lfb 15048x4MB) SWAP 0/31390MB (cached 0MB) CPU [15%@2201,0%@2201,27%@2201,59%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201] EMC_FREQ 0%@3199 GR3D_FREQ 0%@[1300,1300] VIC_FREQ 729 APE 174 CV0@-256C CPU@51.281C Tboard@38C SOC2@47.156C Tdiode@40.75C SOC0@47.906C CV1@-256C GPU@46.281C tj@51.281C SOC1@45.875C CV2@-256C VDD_GPU_SOC 4811mW/4811mW VDD_CPU_CV 2004mW/2004mW VIN_SYS_5V0 5359mW/5370mW VDDQ_VDD2_1V8AO 1011mW/1011mW
03-12-2024 15:26:00 RAM 1669/62780MB (lfb 15048x4MB) SWAP 0/31390MB (cached 0MB) CPU [15%@2201,40%@2201,0%@2201,45%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201,0%@2201] EMC_FREQ 0%@3199 GR3D_FREQ 0%@[1300,1300] VIC_FREQ 729 APE 174 CV0@-256C CPU@51.593C Tboard@38C SOC2@47.187C Tdiode@40.75C SOC0@48C CV1@-256C GPU@46.25C tj@51.593C SOC1@45.718C CV2@-256C VDD_GPU_SOC 4811mW/4811mW VDD_CPU_CV 2004mW/2004mW VIN_SYS_5V0 5359mW/5369mW VDDQ_VDD2_1V8AO 1011mW/1011mW

Hi,
Please try the setting and see if it helps:
Poor DMA performance over PCIe from FPGA - #4 by WayneWWW
Poor DMA performance over PCIe from FPGA - #6 by WayneWWW

dma-coherent was there when test was done.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.