Installed HighPoint SSD7505 PCie 4.0 x16 On Xavier AGX get less than x8 lanes performance

Hello Nvidia Developer,

Please check below attached report, if you have any suggestion or thoughts why performance is limit under 8000MB/s vs AMD platform can get over 20GB/s performance?

Performance Report Comparing Polling vs Non-Polling Result

Thanks.
Highpoint Technologies, Inc.

Hi,

  1. Tegra PCIe capability is Gen4, x8, with 256B max payload size. Max achievable bandwidth = 95% Theoretical Bandwidth = 14086 MB/s. Probably AMD is supporting Gen4, x16, with 2048B max payload size.
  2. Polling mode should have definitely improved performance. Please refer to below link which explains why polling mode helps. Please check if kernel is updated or not using “uname -a” command. If your application is using preadv2/pwritev2 with RWF_HIPRI flag then you don’t need to change kernel to enable polling.
    https://events.static.linuxfound.org/sites/events/files/slides/lemoal-nvme-polling-vault-2017-final_0.pdf
  3. Replicate same capabilities on AMD and check the performance and see if matches. Reduce link width to x8, set max payload size to 256B. You can do this by using setpci tool in ubuntu. Refer to PCIe spec to known which registers to be programmed to change these parameters.

Thanks,
Manikanta

Hello,
I tested the performance of the Seagate FireCuda 520 and Samsung 980 PRO on the AMD motherboard at X8 bandwidth and X16 bandwidth, and compared the performance on AGX Xavier. The tool used is fio, and the script is modified --ioengine=pvsync2 --hipri to use the preadv2/pwritev2 call of the RWF_HIPRI flag.
Here is the performance of the test.
NVIDIA Jetson AGX Xavier:
(4xNVMe SSD Parallel Read & Write Performance)
Seagate FireCuda 520
2m-seq-read 9379MB/s
2m-seq-write 10.6GB/s
Samsung 980 PRO
2m-seq-read 9409MB/s
2m-seq-write 11.8GB/s
AMD Motherboard(X570 AORUS MASTER)
(4xNVMe SSD Parallel Read & Write Performance)
X8
Seagate FireCuda 520
2m-seq-read 11.1GB/s
2m-seq-write 12.2GB/s
Samsung 980 PRO
2m-seq-read 14.2GB/s
2m-seq-write 13.1GB/s

X16
Seagate FireCuda 520
2m-seq-read (pvsync2) 6537MB/s
2m-seq-read (libaio) 19.9GB/s
2m-seq-write 15.7GB/s
Samsung 980 PRO
2m-seq-read 23.3GB/s
2m-seq-write 15.6GB/s

After modifying the fio performance test script, the polling mode is enabled, the performance is improved compared to the previous script test, but the maximum performance cannot be reached; Compared with AMD motherboard performance test, the read and write performance on NVIDIA Jetson AGX Xavier is still a bit lower after the polling mode is enabled.

Hi,

  1. What are Maxpayload & MaxReadReq values set in DevCtl for AMD? Is it higher than Jetson AGX? If yes, then this could one of the reason because “Data payload”/“total TLP size” will be higher in AMD. We cannot do much about it because Jetson AGX capability is less than NVMe.
  2. Jetson AGX doesn’t support per CPU MSIs. All MSIs are supported by single wired interrupt, so all are scheduled on same CPU, default is CPU0. You can compare per CPU MSI count between Jetson AGX and AMD by dumping “cat /proc/interrupts”. This is also limitation from Jetson AGX side.

We verify Jetson AGX Gen4, x8 PCIe bus bandwidth capability, we achieved ~14GBps with internal PCIe DMA. So, I don’t doubt Jetson AGX PCIe bus capability. Above two limitations might be causing ~15% drop in perf when compared with AMD.

If you want to confirm the same, then you can replicate same two scenarios on AMD and verify perf. Use setpci to reduce MPS & MRR on AMD and use Linux sysfs path to set affinity of all MSIs to CPU0.

Thanks,
Manikanta

Hi,

One more thing you can try is to disable iommu. By default iommu is enabled on Jetson AGX, you can disable it and check if there is perf improvement. Apply below patches and flash kernel and device tree blob.

--- a/kernel-dts/tegra194-soc/tegra194-soc-pcie.dtsi

+++ b/kernel-dts/tegra194-soc/tegra194-soc-pcie.dtsi

@@ - 867 , 13 + 867 , 6 @@

pinctrl- 0 = <&pex_rst_c5_out_state>;

pinctrl- 1 = <&clkreq_c5_bi_dir_state>;

- iommus = <&smmu TEGRA_SID_PCIE5>;

- dma-coherent;

-# if LINUX_VERSION >= 414

- iommu-map = < 0x0 &smmu TEGRA_SID_PCIE5 0x1000 >;

- iommu-map-mask = < 0x0 >;

-#endif

-

#interrupt-cells = < 1 >;

interrupt-map-mask = < 0 0 0 0 >;

interrupt-map = < 0 0 0 0 &intc 0 53 0x04 >;

--- a/drivers/iommu/arm-smmu-t19x.c

+++ b/drivers/iommu/arm-smmu-t19x.c

@@ - 2535 , 7 + 2535 , 10 @@ static void arm_smmu_device_reset(struct arm_smmu_device *smmu)

reg = readl_relaxed(ARM_SMMU_GR0_NS(smmu) + ARM_SMMU_GR0_sCR0);

/* Enable fault reporting */

- reg |= (sCR0_GFRE | sCR0_GFIE | sCR0_GCFGFRE | sCR0_GCFGFIE | sCR0_USFCFG);

+ reg |= (sCR0_GFRE | sCR0_GFIE | sCR0_GCFGFRE | sCR0_GCFGFIE);

+

+ /* Disable Unidentified stream fault reporting */

+ reg &= ~(sCR0_USFCFG);

/* Disable TLB broadcasting. */

reg |= (sCR0_VMIDPNE | sCR0_PTM);

Thanks,
Manikanta

Hi,
Maxpayload & MaxReadReq values set in DevCtl for AMD are the same as for Jetson AGX,
Mps: 256 bytes Mrrs: 512 bytes,and
I modified the fio test script on the AMD motherboard to specify CPU0 to execute the program for testing performance,
*taskset -c 0 fio --filename=/mnt/test/test.bin --filename=/mnt/test1/test1.bin *
–filename=/mnt/test2/test2.bin --filename=/mnt/test3/test3.bin
–direct=1 --rw=read --ioengine=pvsync2 --hipri --bs=2m --iodepth=64 --size=10G --numjobs=6 --runtime=60 --time_base=1 --group_reporting --name=test-seq-read
and the result is
Samsung 980 PRO (4xNVMe SSD Parallel Read & Write Performance) Width X8
2m-seq-read: 13.8GB/s
2m-seq-write: 13.4 GB/s
basically no change.
Is there any other way to disable the iommu without applying the patches?

Thanks
Highpoint Technologies, Inc.

Hi,

No, you need to apply these patches to disable IOMMU.

Thanks,
Manikanta

Hi,

I tried to patch and flash the kernel, but after rebooting, I checked the IOMMU print information, IOMMU still started normally, I don’t know if IOMMU has been disabled. Then I tested the performance, the results were not as good as expected, and they were still similar to the previous results.

NVIDIA Jetson AGX Xavier:

Samsung 980 PRO (4xNVMe SSD Parallel Read & Write Performance)

2m-seq-read 9370MB/s

2m-seq-write 11.0GB/s
iommu.txt (6.8 KB)

Thanks.
Highpoint Technologies, Inc.

Yes, iommu is enabled for all PCIe controllers. I doubt if DT and kernel are updated or not.
Check following things,

  1. ls -l /proc/device-tree//iommus, this node should not be present.
  2. uname -a, should give latest kernel build info.

Thanks,
Manikanta

HI,

These are the displayed information.
devicetree.txt (10.6 KB)
uname.txt (111 Bytes)

After I applied the patch, I executed the nvbuild.sh script and generated the Image file , then I replaced the Image file in the system with it. Is the way to flash the kernel correct?

Thanks.
Highpoint Technologies, Inc.

Based on kernel time stamp it looks like kernel image is updated.
Linux test-desktop 4.9.201-tegra #2 SMP PREEMPT Tue Apr 27 14:43:00 CST 2021 aarch64 aarch64 aarch64 GNU/Linux

Can you dump following information.
ls -l /proc/device-tree/pcie@14180000/
ls -l /proc/device-tree/pcie@14100000/
ls -l /proc/device-tree/pcie@14140000/
ls -l /proc/device-tree/pcie@141a0000/

pcie14180000.txt (2.6 KB)
pcie14100000.txt (2.8 KB)

pcie14140000.txt (2.7 KB)
pcie141a0000.txt (3.0 KB)

pcie@141a0000 node still has iommus, DTB is not updated. Apply following patch and flash DTB.

--- a/kernel-dts/tegra194-soc/tegra194-soc-pcie.dtsi

+++ b/kernel-dts/tegra194-soc/tegra194-soc-pcie.dtsi

@@ - 867 , 13 + 867 , 6 @@

`` pinctrl- 0 = <&pex_rst_c5_out_state>;

`` pinctrl- 1 = <&clkreq_c5_bi_dir_state>;

- iommus = <&smmu TEGRA_SID_PCIE5>;

- dma-coherent;

-# if LINUX_VERSION >= 414

- iommu-map = < 0x0 &smmu TEGRA_SID_PCIE5 0x1000 >;

- iommu-map-mask = < 0x0 >;

-#endif

Hi,

I checked the files (tegra194-soc-pcie.dtsi) that need to be patched, and the content that needs to be modified displayed in the patch has been changed.
After I recompile the kernel, in addition to the Image file that needs to be replaced, are there other files that need to be replaced?

Thanks
Highpoint Technologies, Inc.

hello mhwang1,

please also refer to developer guide, Flashing a Specific Partition.
you’re able to flash a specific partition instead of flashing the whole device by using the command line switch, ‑k.
note,
it’s CBoot feature to include a default booting scan sequence.
CBoot looks for an extlinux.conf configuration file for load binaries, by default, the kernel binary file from the LINUX entry and device tree blob loads from kernel-dtb partition.
that’s say,
you may update /boot/Image to update kernel image, but you’ll need to perform ./flash.sh -k kernel-dtb to update kernel-dtb partition.
thanks