Transfer Speed NVMe SSDs

Hello,

I have a Jetson TX2i connected to 3 NVMe SSDs through a PCIe switch:

  • Switch: Broadcom’s Gen3 PCIe switch 8724, The

  • SSDs: Samsung 970 Pro, also Gen3 x4 PCIe

I have tested both the Switch and the SSDs with another Host as root (not the TX2i) and I was able to get the expected transfer rates (writes of about 2.7GB/s and reads of more than 3GB/s).
The first thing that is off when using the TX2i to access the SSDs is that after booting, two of the SSDs appear to be operating a Gen2 (5GT/s) and one is operating at Gen1 (2.5GT/s).

This howerver is not that surprising, SSDs apparently take some time to boot, and if link negotiation happens too fast they often negotiate Gen1 Speeds.
I can access my PCIe switch using the I2C bus and trigger a link re-training, If I do that, the two SSDs which were operating a Gen2 continue to operate at that speed, while the one that was operating at Gen1 goes up to Gen3…

Even more strange, if I write a simple C code which opens the device nodes in /dev/nvme0n0 etc. and reads from them, I get a maximum of around 850MB/s no matter what SSD I actually read from. No noticable speed difference is seen from one SSD to the others even though one is supposed to be operating 4 times as fast (Gen1 vs. Gen3). I do see different data being read so I am sure I am actually reading from the different SSDs and not always hitting the same one.

Also, if I run my test program before re-training the link I also see the same transfer speed across all SSDs, even though one is operating at Gen1 and the others at Gen2…

I am manually retraining the link through I2C. I have used that method in the past and I am quiet certain it works well. Other methods like setpci have just not work for me, I don’t know why but they don’t seem reliable.

sudo setpci -s CAP_EXP+0x10.L=0x00000020:0x00000020

Why am i seeing transfer rates of Gen1 when my SSDs are running at Gen2/Gen3?
Why two of my SSDs seem to be unable to work at Gen3 when I have verify they can do that using another host?

I am attaching the output of my lspci both before and after re-training the SSD links. The very last device (06:00.0) is the SSD which starts working at Gen3 after the retraining, the other two (04:00.0 and 05:00.0) stay at Gen2.
lspci_after_retrain.txt (42.4 KB) lspci_before_retrain.txt (42.4 KB)

Also, in my TX2i I seem to be unable to do a “echo 1 > /sys/bus/pci/devices/0000:00:01.0/rescan” or any other kind of pcie thgouth sysfs without doing a “sudo chmod 777 <file_node>” before. Why is that? Is that normal?

nvidia@tegra-ubuntu:~ sudo echo 1 > /sys/bus/pci/devices/0000\:04\:00.0/rescan -bash: /sys/bus/pci/devices/0000:04:00.0/rescan: Permission denied nvidia@tegra-ubuntu:~ sudo chmod 777 /sys/bus/pci/devices/0000:04:00.0/rescan
nvidia@tegra-ubuntu:~$ sudo echo 1 > /sys/bus/pci/devices/0000:04:00.0/rescan

TX2i’s PCIe is capable of Gen-2 and x4. I’m wondering if switch has any logic to downgrade the downstream links based on the link speed and width between its upstream port and TX2i’s host. Could you please check with the switch vendor?
Can you use any standard tool like ‘dd’ and update the bandwidth observed?
dd if=/dev/zero of=/dev/nvme0n1 bs=16M count=64 oflag=direct conv=fdatasync

Hi @vidyas
This board is made in a way that by configuring the switch we can actually change which device is the PCIe root. So I can completelly turn off the TX2i and then read/write the SSDs from an FPGA (also going through the switch).
Doing that gives me the speeds that I am expecting, so the switch IS capable of all this, it must be something the Tegra is doing… That is why I attached the lspci outputs.

I also tried with the TX2 development kit and an Applicatta quattro with 3SSDs connected to it. This board has the same switch that I am using in my board and I could also see how the GPU was somehow causing the SSDs downstream to work at Gen1-Gen2, never at Gen3…

I have tested with dd as you requested and surprisingly dd reports speeds even lower than my own C program did…
dd repots around 250MB/s, no matter if the SSD has negotiated at Gen1, Gen2 or Gen3…

nvidia@tegra-ubuntu:~$ sudo dd if=/dev/nvme0n1 of=/dev/null bs=512 count=10000000
10000000+0 records in
10000000+0 records out
5120000000 bytes (5.1 GB, 4.8 GiB) copied, 19.7591 s, 259 MB/s

I have tried with different sizes.
What could be the issue here?

So I ran some more tests:

with that Applicatta quattro board and the TX2 develoment board I get this results. Note that I am not using iflag=direct, which I don’t know why but seems to impact the results a lot, making it much faster:

These results are basically reading 1GB with different block sizes and without the iflag=direct

nvidia@tegra-ubuntu:~$ ./test_speeds.sh
1M block
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 1.14724 s, 914 MB/s
10M block
100+0 records in
100+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 1.41301 s, 742 MB/s
100M block
10+0 records in
10+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 1.18232 s, 887 MB/s
1000M block
1+0 records in
1+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 1.82126 s, 576 MB/s

These results are basically reading 1GB with different block sizes and with the iflag=direct

1M block
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 2.14356 s, 489 MB/s
10M block
100+0 records in
100+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 0.832407 s, 1.3 GB/s
100M block
10+0 records in
10+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 0.768115 s, 1.4 GB/s
1000M block
1+0 records in
1+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 1.08226 s, 969 MB/s

These results are basically reading 5GB with different block sizes and with the iflag=direct

1M block
5000+0 records in
5000+0 records out
5242880000 bytes (5.2 GB, 4.9 GiB) copied, 11.0604 s, 474 MB/s
10M block
500+0 records in
500+0 records out
5242880000 bytes (5.2 GB, 4.9 GiB) copied, 4.17893 s, 1.3 GB/s
100M block
50+0 records in
50+0 records out
5242880000 bytes (5.2 GB, 4.9 GiB) copied, 3.69602 s, 1.4 GB/s
1000M block
5+0 records in
5+0 records out
5242880000 bytes (5.2 GB, 4.9 GiB) copied, 3.73779 s, 1.4 GB/s

These are still not Gen2x4 speeds, but they are much better. According to lspci my SSDs have still negotiated to Gen2x4. I am still curious why I can’t get them to Gen3x4 but since the TX2 root port is Gen2x4 I guess it is not that important. All I want is to get closer to the theoretical speed (of about 2GB/s since this is Gen2x4).

If I run the same tests on my custom board I get the following:

These results are basically reading 1GB with different block sizes and without the iflag=direct

nvidia@tegra-ubuntu:~$ ./test_speed.sh
1M block
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 4.59133 s, 228 MB/s
10M block
100+0 records in
100+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 5.45304 s, 192 MB/s
100M block
10+0 records in
10+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 5.54286 s, 189 MB/s
1000M block
1+0 records in
1+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 5.25225 s, 200 MB/s

These results are basically reading 1GB with different block sizes and with the iflag=direct

1M block
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 2.73153 s, 384 MB/s
10M block
100+0 records in
100+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 1.09067 s, 961 MB/s
100M block
10+0 records in
10+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 1.45053 s, 723 MB/s
1000M block
1+0 records in
1+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 2.47037 s, 424 MB/s

These results are basically reading 5GB with different block sizes and with the iflag=direct

1M block
5000+0 records in
5000+0 records out
5242880000 bytes (5.2 GB, 4.9 GiB) copied, 16.9431 s, 309 MB/s
10M block
500+0 records in
500+0 records out
5242880000 bytes (5.2 GB, 4.9 GiB) copied, 5.67265 s, 924 MB/s
100M block
50+0 records in
50+0 records out
5242880000 bytes (5.2 GB, 4.9 GiB) copied, 5.99964 s, 874 MB/s
1000M block
5+0 records in
5+0 records out
5242880000 bytes (5.2 GB, 4.9 GiB) copied, 7.46426 s, 702 MB/s

The speeds are “close” but the difference is too big to be negligible. We drop from 1.4GB/s to 920MB/s, and in theory we should be going close to 2GB.
One of the few differences between these two tests is that in my custom board I have disabled the “clkreq” signal in the PCIe express, so that hopefully that clock will be active no matter what.
This is the configuraiton for the pcie root port in my customized device tree:

    pcie-controller@10003000 {
            compatible = "nvidia,tegra186-pcie";
            power-domains = <0xcc>;
            device_type = "pci";
            reg = <0x0 0x10003000 0x0 0x800 0x0 0x10003800 0x0 0x800 0x0 0x40000000 0x0 0x10000000>;
            reg-names = "pads", "afi", "cs";
            clocks = <0xd 0x4 0xd 0x3 0xd 0x261>;
            clock-names = "afi", "pcie", "clk_m";
            resets = <0xd 0x1 0xd 0x1d 0xd 0x1e>;
            reset-names = "afi", "pcie", "pciex";
            interrupts = <0x0 0x48 0x4 0x0 0x49 0x4>;
            interrupt-names = "intr", "msi";
            #interrupt-cells = <0x1>;
            interrupt-map-mask = <0x0 0x0 0x0 0x0>;
            interrupt-map = <0x0 0x0 0x0 0x0 0x1 0x0 0x48 0x4>;
            #stream-id-cells = <0x1>;
            bus-range = <0x0 0xff>;
            #address-cells = <0x3>;
            #size-cells = <0x2>;
            ranges = <0x82000000 0x0 0x10000000 0x0 0x10000000 0x0 0x1000 0x82000000 0x0 0x10001000 0x0 0x10001000 0x0 0x1000 0x82000000 0x0 0x10004000 0x0 0x10004000 0x0 0x1000 0x81000000 0x0 0x0 0x0 0x50000000 0x0 0x10000 0x82000000 0x0 0x50100000 0x0 0x50100000 0x0 0x7f00000 0xc2000000 0x0 0x58000000 0x0 0x58000000 0x0 0x28000000>;
            status = "okay";
            vddio-pexctl-aud-supply = <0xe>;
            linux,phandle = <0x76>;
            phandle = <0x76>;

            pci@1,0 {
                    device_type = "pci";
                    assigned-addresses = <0x82000800 0x0 0x10000000 0x0 0x1000>;
                    reg = <0x800 0x0 0x0 0x0 0x0>;
                    status = "okay";
                    #address-cells = <0x3>;
                    #size-cells = <0x2>;
                    ranges;
                    nvidia,num-lanes = <0x4>;
                    nvidia,afi-ctl-offset = <0x110>;
                    nvidia,disable-clock-request;
            };

            pci@2,0 {
                    device_type = "pci";
                    assigned-addresses = <0x82001000 0x0 0x10001000 0x0 0x1000>;
                    reg = <0x1000 0x0 0x0 0x0 0x0>;
                    status = "disabled";
                    #address-cells = <0x3>;
                    #size-cells = <0x2>;
                    ranges;
                    nvidia,num-lanes = <0x1>;
                    nvidia,afi-ctl-offset = <0x118>;
                    nvidia,disable-clock-request;
            };

            pci@3,0 {
                    device_type = "pci";
                    assigned-addresses = <0x82001800 0x0 0x10004000 0x0 0x1000>;
                    reg = <0x1800 0x0 0x0 0x0 0x0>;
                    status = "disabled";
                    #address-cells = <0x3>;
                    #size-cells = <0x2>;
                    ranges;
                    nvidia,num-lanes = <0x1>;
                    nvidia,afi-ctl-offset = <0x19c>;
                    nvidia,disable-clock-request;
            };

            prod-settings {
                    #prod-cells = <0x3>;

                    prod_c_pad {
                            prod = <0xc8 0xffffffff 0x80b880b8 0xcc 0xffffffff 0x480b8>;
                    };
            };
    };

For “dd” parameter “bs=” try much larger sizes. For example, “bs=4M” or “bs=16M”. Size 512 will generally be very slow under most environments.

To know what mode your PCIe is actually running at, run “lspci” to see the slot ID (which will look similar to something like “01:00.0”…substitute with your actual value), and then use sudo to produce a verbose lspci for the device:
sudo lspci -s 01:00.0 -vvv

If you read my posts I have done all this…

Although theoretical bandwidth is 2 GB/s (after considering 8-bit/10-bit encoding), there are still more protocol overhead factors to be considered (like flow control packets and packet headers etc…consuming bandwidth), the effective available BW in TX2 is around 1.5 GB/s and I see that you are already going closer to that.