Installed HighPoint SSD7505 PCie 4.0 x16 On Xavier AGX get less than x8 lanes performance

mhwang1 · April 15, 2021, 5:47am

Hello Nvidia Developer,

Please check below attached report, if you have any suggestion or thoughts why performance is limit under 8000MB/s vs AMD platform can get over 20GB/s performance?

Performance Report Comparing Polling vs Non-Polling Result

Thanks.
Highpoint Technologies, Inc.

Manikanta · April 15, 2021, 10:09am

Hi,

Tegra PCIe capability is Gen4, x8, with 256B max payload size. Max achievable bandwidth = 95% Theoretical Bandwidth = 14086 MB/s. Probably AMD is supporting Gen4, x16, with 2048B max payload size.
Polling mode should have definitely improved performance. Please refer to below link which explains why polling mode helps. Please check if kernel is updated or not using “uname -a” command. If your application is using preadv2/pwritev2 with RWF_HIPRI flag then you don’t need to change kernel to enable polling.
https://events.static.linuxfound.org/sites/events/files/slides/lemoal-nvme-polling-vault-2017-final_0.pdf
Replicate same capabilities on AMD and check the performance and see if matches. Reduce link width to x8, set max payload size to 256B. You can do this by using setpci tool in ubuntu. Refer to PCIe spec to known which registers to be programmed to change these parameters.

Thanks,
Manikanta

mhwang1 · April 23, 2021, 3:19am

Hello,
I tested the performance of the Seagate FireCuda 520 and Samsung 980 PRO on the AMD motherboard at X8 bandwidth and X16 bandwidth, and compared the performance on AGX Xavier. The tool used is fio, and the script is modified --ioengine=pvsync2 --hipri to use the preadv2/pwritev2 call of the RWF_HIPRI flag.
Here is the performance of the test.
NVIDIA Jetson AGX Xavier：
(4xNVMe SSD Parallel Read & Write Performance)
Seagate FireCuda 520
2m-seq-read 9379MB/s
2m-seq-write 10.6GB/s
Samsung 980 PRO
2m-seq-read 9409MB/s
2m-seq-write 11.8GB/s
AMD Motherboard（X570 AORUS MASTER）
(4xNVMe SSD Parallel Read & Write Performance)
X8
Seagate FireCuda 520
2m-seq-read 11.1GB/s
2m-seq-write 12.2GB/s
Samsung 980 PRO
2m-seq-read 14.2GB/s
2m-seq-write 13.1GB/s

X16
Seagate FireCuda 520
2m-seq-read (pvsync2) 6537MB/s
2m-seq-read (libaio) 19.9GB/s
2m-seq-write 15.7GB/s
Samsung 980 PRO
2m-seq-read 23.3GB/s
2m-seq-write 15.6GB/s

After modifying the fio performance test script, the polling mode is enabled, the performance is improved compared to the previous script test, but the maximum performance cannot be reached; Compared with AMD motherboard performance test, the read and write performance on NVIDIA Jetson AGX Xavier is still a bit lower after the polling mode is enabled.

Manikanta · April 23, 2021, 3:45am

Hi,

What are Maxpayload & MaxReadReq values set in DevCtl for AMD? Is it higher than Jetson AGX? If yes, then this could one of the reason because “Data payload”/“total TLP size” will be higher in AMD. We cannot do much about it because Jetson AGX capability is less than NVMe.
Jetson AGX doesn’t support per CPU MSIs. All MSIs are supported by single wired interrupt, so all are scheduled on same CPU, default is CPU0. You can compare per CPU MSI count between Jetson AGX and AMD by dumping “cat /proc/interrupts”. This is also limitation from Jetson AGX side.

We verify Jetson AGX Gen4, x8 PCIe bus bandwidth capability, we achieved ~14GBps with internal PCIe DMA. So, I don’t doubt Jetson AGX PCIe bus capability. Above two limitations might be causing ~15% drop in perf when compared with AMD.

If you want to confirm the same, then you can replicate same two scenarios on AMD and verify perf. Use setpci to reduce MPS & MRR on AMD and use Linux sysfs path to set affinity of all MSIs to CPU0.

Thanks,
Manikanta

Manikanta · April 23, 2021, 5:38am

Hi,

One more thing you can try is to disable iommu. By default iommu is enabled on Jetson AGX, you can disable it and check if there is perf improvement. Apply below patches and flash kernel and device tree blob.

--- a/kernel-dts/tegra194-soc/tegra194-soc-pcie.dtsi

+++ b/kernel-dts/tegra194-soc/tegra194-soc-pcie.dtsi

@@ - 867 , 13 + 867 , 6 @@

pinctrl- 0 = <&pex_rst_c5_out_state>;

pinctrl- 1 = <&clkreq_c5_bi_dir_state>;

- iommus = <&smmu TEGRA_SID_PCIE5>;

- dma-coherent;

-# if LINUX_VERSION >= 414

- iommu-map = < 0x0 &smmu TEGRA_SID_PCIE5 0x1000 >;

- iommu-map-mask = < 0x0 >;

-#endif

-

#interrupt-cells = < 1 >;

interrupt-map-mask = < 0 0 0 0 >;

interrupt-map = < 0 0 0 0 &intc 0 53 0x04 >;

--- a/drivers/iommu/arm-smmu-t19x.c

+++ b/drivers/iommu/arm-smmu-t19x.c

@@ - 2535 , 7 + 2535 , 10 @@ static void arm_smmu_device_reset(struct arm_smmu_device *smmu)

reg = readl_relaxed(ARM_SMMU_GR0_NS(smmu) + ARM_SMMU_GR0_sCR0);

/* Enable fault reporting */

+ reg |= (sCR0_GFRE | sCR0_GFIE | sCR0_GCFGFRE | sCR0_GCFGFIE);

+

+ /* Disable Unidentified stream fault reporting */

+ reg &= ~(sCR0_USFCFG);

/* Disable TLB broadcasting. */

reg |= (sCR0_VMIDPNE | sCR0_PTM);

Thanks,
Manikanta

mhwang1 · April 26, 2021, 6:53am

Hi，
Maxpayload & MaxReadReq values set in DevCtl for AMD are the same as for Jetson AGX，
Mps: 256 bytes Mrrs: 512 bytes，and
I modified the fio test script on the AMD motherboard to specify CPU0 to execute the program for testing performance,
*taskset -c 0 fio --filename=/mnt/test/test.bin --filename=/mnt/test1/test1.bin *
–filename=/mnt/test2/test2.bin --filename=/mnt/test3/test3.bin
–direct=1 --rw=read --ioengine=pvsync2 --hipri --bs=2m --iodepth=64 --size=10G --numjobs=6 --runtime=60 --time_base=1 --group_reporting --name=test-seq-read
and the result is
Samsung 980 PRO (4xNVMe SSD Parallel Read & Write Performance) Width X8
2m-seq-read: 13.8GB/s
2m-seq-write: 13.4 GB/s
basically no change.
Is there any other way to disable the iommu without applying the patches?

Thanks
Highpoint Technologies, Inc.

Manikanta · April 26, 2021, 3:28pm

Hi,

No, you need to apply these patches to disable IOMMU.

Thanks,
Manikanta

mhwang1 · April 28, 2021, 2:30am

Hi,

I tried to patch and flash the kernel, but after rebooting, I checked the IOMMU print information, IOMMU still started normally, I don’t know if IOMMU has been disabled. Then I tested the performance, the results were not as good as expected, and they were still similar to the previous results.

NVIDIA Jetson AGX Xavier：

Samsung 980 PRO (4xNVMe SSD Parallel Read & Write Performance)

2m-seq-read 9370MB/s

2m-seq-write 11.0GB/s
iommu.txt (6.8 KB)

Thanks.
Highpoint Technologies, Inc.

Manikanta · April 28, 2021, 4:39am

Yes, iommu is enabled for all PCIe controllers. I doubt if DT and kernel are updated or not.
Check following things,

ls -l /proc/device-tree//iommus, this node should not be present.
uname -a, should give latest kernel build info.

Thanks,
Manikanta

mhwang1 · April 28, 2021, 5:43am

HI,

These are the displayed information.
devicetree.txt (10.6 KB)
uname.txt (111 Bytes)

After I applied the patch, I executed the nvbuild.sh script and generated the Image file , then I replaced the Image file in the system with it. Is the way to flash the kernel correct?

Thanks.
Highpoint Technologies, Inc.

Manikanta · April 28, 2021, 5:50am

Based on kernel time stamp it looks like kernel image is updated.
Linux test-desktop 4.9.201-tegra #2 SMP PREEMPT Tue Apr 27 14:43:00 CST 2021 aarch64 aarch64 aarch64 GNU/Linux

Can you dump following information.
ls -l /proc/device-tree/pcie@14180000/
ls -l /proc/device-tree/pcie@14100000/
ls -l /proc/device-tree/pcie@14140000/
ls -l /proc/device-tree/pcie@141a0000/

mhwang1 · April 28, 2021, 5:59am

pcie14180000.txt (2.6 KB)
pcie14100000.txt (2.8 KB)

mhwang1 · April 28, 2021, 6:02am

pcie14140000.txt (2.7 KB)
pcie141a0000.txt (3.0 KB)

Manikanta · April 28, 2021, 6:13am

pcie@141a0000 node still has iommus, DTB is not updated. Apply following patch and flash DTB.

--- a/kernel-dts/tegra194-soc/tegra194-soc-pcie.dtsi

+++ b/kernel-dts/tegra194-soc/tegra194-soc-pcie.dtsi

@@ - 867 , 13 + 867 , 6 @@

`` pinctrl- 0 = <&pex_rst_c5_out_state>;

`` pinctrl- 1 = <&clkreq_c5_bi_dir_state>;

- iommus = <&smmu TEGRA_SID_PCIE5>;

- dma-coherent;

-# if LINUX_VERSION >= 414

- iommu-map = < 0x0 &smmu TEGRA_SID_PCIE5 0x1000 >;

- iommu-map-mask = < 0x0 >;

-#endif

mhwang1 · May 6, 2021, 9:33am

Hi，

I checked the files (tegra194-soc-pcie.dtsi) that need to be patched, and the content that needs to be modified displayed in the patch has been changed.
After I recompile the kernel, in addition to the Image file that needs to be replaced, are there other files that need to be replaced?

Thanks
Highpoint Technologies, Inc.

JerryChang · May 6, 2021, 2:04pm

hello mhwang1,

please also refer to developer guide, Flashing a Specific Partition.
you’re able to flash a specific partition instead of flashing the whole device by using the command line switch, ‑k.
note,
it’s CBoot feature to include a default booting scan sequence.
CBoot looks for an extlinux.conf configuration file for load binaries, by default, the kernel binary file from the LINUX entry and device tree blob loads from kernel-dtb partition.
that’s say,
you may update /boot/Image to update kernel image, but you’ll need to perform ./flash.sh -k kernel-dtb to update kernel-dtb partition.
thanks

mhwang1 · May 17, 2021, 1:33am

Hi,

Sorry for replying so late,but I can’t find which folder of flsah.sh is in the Jetson AGX Xavier system. The first time I flashed the system and installed the system using SDKmanager, can you tell me how to find the directory where flash.sh is located?

Thanks.
Highpoint Technologies, Inc.

JerryChang · May 17, 2021, 2:45am

hello mhwang1,

it’s by default located at ~/nvidia/nvidia_sdk/ following with the JetPack release version,
thanks

mhwang1 · May 19, 2021, 5:37am

Hi,

I have found the flash.sh file, which is in the host system where I flashed Jetson AGX Xavier, so I executed
sudo ./flash.sh -k kernel-dtb jetson-xavier mmcblk0p1
to update the kernel-dtb partition. After the successful execution, it prompts that I have successfully updated the device tree, but when I return to the Jetson AGX Xavier system, I find that pcie @141a0000 node still has iommus.
I am very confused why this happens, after patching and modifying the tegra194-soc-pcie.dtsi file in the Jetson AGX Xavier system, execute the flash.sh file in the host machine for flashing, will this allow me to successfully update the device tree?The Jetson AGX Xavier has been turned into recovery mode already.

Thanks
Highpoint Technologies, Inc.

JerryChang · May 19, 2021, 6:26am

hello mhwang1,

please double confirm you’re include the change to build the device tree blob.
flash command, $ sudo ./flash.sh -k kernel-dtb it pick up dtb under $OUT/Linux_for_Tegra/kernel/dtb/.
please also confirm you’re replacing correct device tree blob to update kernel-dtb partition.
thanks

Topic		Replies	Views
Imbalanced Performance between Read and Write Performance Jetson AGX Xavier	19	2116	December 14, 2018
ARM64 Nvidia Jetson AGX Orin development kit 32GB only 5GB/s performance Jetson AGX Xavier ubuntu , nvbugs	3	893	May 24, 2023
Jetson AGX xavier not able to do DMA operation through PCIE with INTEL FPGA/smmu issue Jetson AGX Xavier pcie	6	1456	February 12, 2023
Xavier PCIe performance Jetson AGX Xavier	13	2908	November 25, 2019
PCIe x4 Speed Issue with SSD Jetson TX1	19	4835	January 15, 2024
Does the Nvidia Xavier support NVME GEN4 SSD? Jetson AGX Xavier nvme	24	8282	October 18, 2021
Jetson Xavier USB3 to NVME speeds Jetson AGX Xavier usb , nvme	9	2208	October 18, 2021
DevKit NVMe performance Jetson Xavier NX nvme	13	2451	June 10, 2020
Gen 3 PCIe NVMe SSD with x4 lanes gets higher IOPS on Nano compared to the Xavier NX Jetson Xavier NX pcie , ssd , nvme	3	1327	September 28, 2022
CPU performance is worse on the Xavier then the TX2 Jetson AGX Xavier	9	2255	October 18, 2021

Installed HighPoint SSD7505 PCie 4.0 x16 On Xavier AGX get less than x8 lanes performance

Related topics