PCIe x4 Bandwidth on Nano

Dear Community,
I have a custom carrier board built and now I am testing the performance of the PCIe x4 interface. For storage we have an M.2 PCIe socket. When I tried an SSD module of PCIe x1 lane, I got 1GB data dump rate at 113 MB/s. The lspci -vv command returns Lane Width as x1.

Now when I connect a PCIe x4 SSD, I get the same data dump rate at ~120MB/s. But the lspci -vv command provides Lane Width as x4 & Link speed as 5Gb/s.

Can somebody help me understand why I don’t get a significant data rate improvement here?

You should get around 12Gbps (i.e. around 1500 MB/s) in PCIe x4 and Gen-2 configuration.
Could you please tell us how you are verifying the performance?
Also, what is the advertised bandwidth of the SSD? (wondering is that AHCI i.e. SATA protocol based or NVMe based)

Hi,

Maybe data rate limitation caused by SSD and not because of PCIe? Can you provide below details,

  1. What is data rate specification of SSD?
  2. How are you measuring the SSD performance, is it iozone? What record size are you using? Can you try with 16M record size?

Thanks,
Manikanta

  1. I am using ‘dd’ command from terminal and dumping 1GB for x1 lane & 4GB for x4 lane.
  2. Advertised bandwidth of PCIe x4 based SSD is 1000MB/s.
  3. It is an NVME based SSD; particularly the Kingston A2000 250GB

Team,
Is there an update regarding this?

Could you please share the exact ‘dd’ command being used here?
I would use the ‘dd’ command something like below
dd if=/dev/zero of=test bs=16M count=64 oflag=direct conv=fdatasync

dd if=/dev/zero of= bs=1024 count=1073741824 status = progress

path to SSD drive obtained by issuing lsblk command

block size of 1024 seems too less. We typically use 4K size for random read/writes and 16M for sequential read/writes. Could you please use the above command that I mentioned and update the result?

Same. Please see attached.

Could you please tell us the JetPack release you are using and also the BSP version?

Also, given that there is an error while writing more than 4GB file, I assume you are using FAT32 file system. Could you please try formatting the drive with EXT4 file system?
mkfs.ext4 command can be used.
FWIW, I tried with an Intel750 NVMe drive and I could see a four fold increase in the performance with EXT4 file system compared to FAT32.

root@tegra-ubuntu:/home/ubuntu/temp# 
root@tegra-ubuntu:/home/ubuntu/temp# dd if=/dev/zero of=test bs=16M count=64 oflag=direct conv=fdatasync
64+0 records in
64+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.28011 s, 839 MB/s
root@tegra-ubuntu:/home/ubuntu/temp# 
root@tegra-ubuntu:/home/ubuntu/temp# cd ..
root@tegra-ubuntu:/home/ubuntu# 
root@tegra-ubuntu:/home/ubuntu# umount temp
root@tegra-ubuntu:/home/ubuntu# 
root@tegra-ubuntu:/home/ubuntu# busybox mkfs.vfat /dev/nvme0n1 

root@tegra-ubuntu:/home/ubuntu# 
[  712.671104]  nvme0n1:
root@tegra-ubuntu:/home/ubuntu# 
root@tegra-ubuntu:/home/ubuntu# mount /dev/nvme0n1 temp/
root@tegra-ubuntu:/home/ubuntu# 
root@tegra-ubuntu:/home/ubuntu# cd temp/
root@tegra-ubuntu:/home/ubuntu/temp# 
root@tegra-ubuntu:/home/ubuntu/temp# dd if=/dev/zero of=test bs=16M count=64 oflag=direct conv=fdatasync
64+0 records in
64+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 5.2717 s, 204 MB/s
root@tegra-ubuntu:/home/ubuntu/temp#

Apologies vidyas, I don’t recollect the file system. I believe it was FAT32 but Ill provide more details when I have access to the hardware.

This bandwidth is impressive. We would like to see the same on our hardware.

What hardware are you using since Nano EVM doesn’t have PCIe x4?

One thing you should know is that for benchmarks you should not use a regular file at all for testing. The reason is that you are going through the driver for ext4 (or VFAT or something else), and that there is a lot going on with seeking and security and setup. In the case where your source is “/dev/zero”, then this works well because it is a device special file, and not a real file…the output limitations of “/dev/zero” are only from the driver.

Consider if you have a partition you can test against where you don’t mind destroying the content of that partition. I’ll call it “/dev/sdaX”. When you use dd to write to “/dev/sdaX”, once security allows this, then all of the throughput limitation is via the driver and the disk. The raw hardware driver is probably not much of a limitation. If you instead write to a file on the same “/dev/sdaX”, then performance would be expected to drop due to all of the extra software in the path.

So then how can i benchmark this?

also why the software isn’t limiting the results posted by vidya?

Functions which provide data to a device special file are not an issue since this operates on a raw device driver and does not require filesystem overhead other than when determining if the user is allowed access. Any time your source data is from “/dev/zero” this is fulfilled. Any time your destination is to a partition instead of a file the requirement is fulfilled. Use something like “gparted” to non-destructively slice part of your rootfs partition into a new partition. Then use dd with the output to that partition (device special file) instead of to a file.

Going through a file is not an error. I’ll compare this though to having an automobile race through the middle of a large city, and consider this to be a bad race car because it doesn’t run as fast as it does on a track designed for racing. Ext4 and some other filesystems may be relatively efficient, but this is far from streamlined. You could end up with the same results, but until you avoid the filesystem you are not testing the bandwidth of the raw device…you are testing the throughput of a series of devices chained together.

In the case of an error with a non-ext4 filesystem, this is quite possibly due to exceeding max file size for that filesystem type. Ext4 has an ability to work with enormous files, whereas some of the 32-bit Windows filesystem types stop at 2GB…if your filesystem is VFAT or FAT32 or one of those, then this will be an outright error forcing dd to stop. Even if you do not fail under one of these other filesystem types I would expect ext4 to be better optimized and faster. Better yet is to not use a filesystem at all, and if you directly access a partition via the device special file, then no filesystem is in the chain (the filesystem is used to find the partition, but then stays out of the way).

You might end up with the same results, but if you pull the filesystem type out of the chain of limitations then you’ll actually know what you are profiling.

What hardware are you using since Nano EVM doesn’t have PCIe x4?

I’m using Jetson-TX1 system which has PCIe x4 slot.
BTW, I tried writing to /dev/nvme0n1 directly by setting of=/dev/nvme0n1 and I could still manage to get around the same number. As @linuxdev mentioned, ext4 file system being a very efficient file system, it is giving perf at par with the raw device. But, I agree with @linuxdev that we should avoid any kind of file system in between and do reads/writes directly with the device file.

@vidyas,
The test system doesn’t show Nano performance but can be considered as an emulation, but PCIe core would be same and test results comparable.

I tested the same with Nano & we were able to achieve 790MB/s.

Thanks for letting us know