Xavier PCIe performance

sfr · March 5, 2019, 10:25am

Hi,

thanks to the patch that was provided in a previous topic (https://devtalk.nvidia.com/default/topic/1047514/jetson-agx-xavier/pcie-non-prefetch-space-size-in-xavier/), we are now able to run traffic across the Xavier PCIe port using our PCIe NTB cable adapters, which connect the Xavier to a PC. The PCIe links are at least Gen3/x8. Xavier performance is set to maximum.

We have a user-space benchmark that uses PIO to write to remote memory which is mapped via PCIe all the way into the PC’s RAM. The mapping is done by the driver using remap_pfn_range() and pgprot_writecombine.

The default implementation of pgprot_writecombine uses the MT_NORMAL_NC attributes.
Here are the results:

Function: sciMemCopy_OS_COPY (0) 
---------------------------------------------------------------
Segment Size:    Average Send Latency:        Throughput:
---------------------------------------------------------------
      4              0.64 us                 6.28 MBytes/s
      8              0.63 us                12.72 MBytes/s
     16              0.63 us                25.29 MBytes/s
     32              0.64 us                50.26 MBytes/s
     64              0.68 us                94.17 MBytes/s
    128              0.69 us               185.21 MBytes/s
    256              1.00 us               254.87 MBytes/s
    512              0.83 us               616.30 MBytes/s
   1024              1.14 us               897.26 MBytes/s
   2048              1.75 us              1169.96 MBytes/s
   4096              2.92 us              1400.75 MBytes/s
   8192              5.17 us              1584.53 MBytes/s
  16384             17.01 us               963.08 MBytes/s
  32768             33.72 us               971.85 MBytes/s
  65536             67.19 us               975.33 MBytes/s

“Average Send Latency” is the time the CPU spends in the STORE instruction to PCIe memory, averaged over 50000 iterations. Compared to other systems, 0.63 usec is rather high. Usually we see less than 0.10 usec.

Also the throughput is well below the capacity of a Gen3/x8 link. When we turn around the direction and the PC writes to the Xavier we see nearly 5700 MBytes/s. A PCIe analyzer revealed that the Xavier generates only 16-Byte TLPs.

The next thing we tried was mapping with MT_DEVICE_GRE. “Gather” should improve throughput by generating larger TLPs and “Early-acknowledge” (write posting) should bring down the send latency. However, the results are not much different from the original ones:

Function: sciMemCopy_OS_COPY (0) 
---------------------------------------------------------------
Segment Size:    Average Send Latency:        Throughput:
---------------------------------------------------------------
      4              0.63 us                 6.33 MBytes/s
      8              0.63 us                12.70 MBytes/s
     16              0.63 us                25.41 MBytes/s
     32              0.64 us                49.91 MBytes/s
     64              1.18 us                54.32 MBytes/s
    128              2.25 us                56.88 MBytes/s
    256              2.29 us               111.75 MBytes/s
    512              2.37 us               216.06 MBytes/s
   1024              2.59 us               395.82 MBytes/s
   2048              3.58 us               572.76 MBytes/s
   4096              4.37 us               937.33 MBytes/s
   8192              6.20 us              1321.27 MBytes/s
  16384             18.13 us               903.79 MBytes/s
  32768             34.84 us               940.46 MBytes/s
  65536             68.16 us               961.44 MBytes/s

Our questions are:

Is it possible to have the PCIe port generate MWr TLPs larger than 16 byte for PIO writes?
Is there another way to enable Early Write Acknowledgement to PCIe space?

Kind regards,
Friedrich

vidyas · March 21, 2019, 10:56am

Please give us some time. We’ll get back to you on this.

vidyas · September 15, 2019, 9:48am

Well, as I heard from our hardware folks, there is no limit set at the hardware level for 16-byte TLP generation. It could probably be coming from the userspace software i.e. based on how SW is written.
Is it possible to share the userspace benchmarking tool being used here? Also, what is the TLP size (in terms of bytes) when TLPs are generated by x86 PC?

slavisa.zigic · November 21, 2019, 4:19pm

Here is one more user of Xavier AGX with a concern about performance of PCIe Gen.3 interface.
Our test is simpler than the one explained above and consists of measuring throughput using SSD drive plugged into M.2 connector of Jetson AGX Xavier Developer Kit.

This is about testing PCIe Gen.3 x4 lane interface.
Performance was compared with Intel x86 PC.

When testing read/write speed of M.2 SSD (2TB, model number can be provided if needed) we got about 400MB/s when using Xavier AGX. On x86 PC, we got more than 1GB/s. The test was performed using Linux dd if=/dev/zero … when writing and also ‘dd’ when reading.

It would be nice from Nvidia to provide actual throughput that can be achieved between PCIe device and Xavier AGX memory (16GB 256-bit LPDDR4x 137GB/s).

Regards,

Slavisa

mdegans · November 21, 2019, 9:11pm

I have experienced identical speeds on PC and Xavier with a Samsung EVO 960 nvme ssd in the key M slot under the heat sink. I used hdparm -tT to benchmark. You should be aware that most SSDs have multiple levels of cache and as such performance many vary depending on how the cache is managed, how full the drive is, how old it is, whether you are reading or writing, whether the cache is flushed, etc.

In your test you are writing to the drive. Writes are routinely sent to fast, single level flash, and later flushed to multi level flash which is slower. If that was just performed after doing the same thing on PC, the cache may not have flushed. I recommend using hdparm -tT for consistent results since it’s a tool designed in part for this kind of benchmarking. Here are my results (nearly identical to the x86 machine the SSD was in before). I expect they would be even better with a faster, newer, SSD.:

/dev/nvme0n1:
 Timing cached reads:   6106 MB in  1.99 seconds = 3063.05 MB/sec
 Timing buffered disk reads: 3684 MB in  3.00 seconds = 1227.98 MB/sec

slavisa.zigic · November 21, 2019, 9:43pm

In my test, SSD was tested for continuous (sequential) read/write. In other words, I was writing/reading for the very long time. More precisely, full capacity write test (total of 2TB writes, until the drive is full). We are talking here about almost half an hour continuous write. That is why cache levels should not be part of the equation. In other words, in first few seconds of write cycle, all of the SSD cache would be filled and then we have only speed of PCIe bus and actual write speed of flash memory on SSD.

Since whole test executes faster on x86 PC (the same OS on both platforms), everything is pointing in direction of PCIe Gen.3 speed on Xavier AGX. Actually, the numbers related to PCIe Gen.3 x4 speed test are almost matching with detailed report of PCIe Gen.3 x8 posted above. The only difference is scaling factor, because x8 lanes are used instead of x4 lanes and hence double the speed.

Regards,

Slavisa

mdegans · November 21, 2019, 9:59pm

In this case, you are benchmarking the drive, not the interface.

I believe you may be partially mistaken. After the first few second/minutes of writing, depending on the size, your SLC (fast flash) is filled up, after which you drive will write to slow multi level flash (TLC/QLC). At that point, the bottleneck is the slow flash memory on the drive. This can be as bad as spinning disk speeds if you have something like an Intel 660p series SSD. 400MB/sec is not unexpected.

Please see the 970 evo specs on anandtech for example.

That is for the 250GB model. It’s better for larger variants. Samsung PRO series is designed differently and uses large amounts of fast flash instead of a fast flash cache backing slow flash. If you need sustained writes at a consistent speed, that will do better (2300 MB/s), but will still not saturate the link.

Did you try the test repeatedly on x86 with no pause in between? Or was the test on Xavier performed immediately after the x86 test? As stated above, there are a host of other factors (ram, system caching policy) that could cause the difference in performance besides the flash cache levels. It’s recommended to use a tool designed for benchmarking tool can account for these like hdparm.

arunas.salkauskas · November 21, 2019, 10:48pm

It may be worth reading Lies, “Damn Lies And SSD Benchmark Test Result”, from Seagate:
https://www.seagate.com/ca/en/tech-insights/lies-damn-lies-and-ssd-benchmark-master-ti/
for some understanding of why there’s so much uncertainty about benchmarking SSDs and taking manufacturer’s performance specs at face value.

mdegans · November 21, 2019, 11:00pm

Storage manufacturers are the worst. They always have been. The whole base2 vs base10 disk size confusion, it’s countless related bugs, and lawsuits, is due to them. SSD write speeds are just the latest in a long list of lies. For most drives, for most people, nobody will ever notice a difference, but try to copy 1/2 TB over to the drive and watch performance hit a wall quickly.

slavisa.zigic · November 22, 2019, 2:02pm

I would go back to original topic “Xavier PCIe performance”. Since we are using SSD to test PCIe throughput, let us eliminate parameters that are not important, like amount and speed of cache on SSD. For 2TB SSD drive, and continuous sequential write, SSD cache size is not important, since cache is saturated after few seconds (cache size is not more than 8GB). After cache is saturated we have only two factors: write speed of PCIe bus and write speed of SSD flash memory.

My observation (as well as observation from Friedrich) is that PCIe speed on Xavier AGX is far below the limit and what x86 PC can do. Friedrich used PCIe analyzer, I used SSD and we got very similar results. When I use the same SSD and plug it into x86 PC, measure more than 1GB/s write speed (using same OS and same Linux command) and then plug it into Xavier AGX and measure 400MB/s, I think that discussion about SSD cache, memory organization on SSD and benchmarks is out of question.

I still live in hope that there is a way to set up Xavier AGX so that we will see decent PCIe speed.
This is non-trivial question for Nvidia, because they used Synopsis IP Core for PCIe express.

Although this IP core is advertised as PCIe Gen.3 and Gen.4 capable, we are observing speeds well below the expected for PCIe Gen.3.

Note: SSD used for tests is Addlink S70, 2TB

mdegans · November 23, 2019, 12:32am

Let’s agree to disagree about the importance of that. In my view you’re only testing the speed of the slowest flash on your drive when you do sustained writes like this.

Which is actually a lot lower than the PCIe link itself. I think you may be confusing the cache on the drive with other caches. Most SSDs slow down on sustained writes, including yours. Reviews report it slowing down after sustained writes.

To be fair, they don’t say what model they’re testing and larger drives are usually faster, but there isn’t a lot of detailed information about your drive out there. You say it’s faster on x86. Maybe it is, but there are too many factors at play to know the cause for sure. I’d suggest testing with an NVIDIA GPU but they aren’t yet supported. You could try something like this, but it’s likely expensive af. Any other suggestions of things to stick in the pci slots? If you have two Xaviers, you could test that way.

slavisa.zigic · November 25, 2019, 1:57pm

Mdegans, can you please try to explain what impact cache and flash speed has in the following setup. To be be more precise how cache and flash are affecting this measurement:

Case 1: Use 2TB SSD connected to PCIe, Gen 3, x4 lane bus and measure continuous sequential write until the drive is 100% full. This is done on Xavier AGX using Linux command ‘dd’. Result (average write speed): ~400MB/s.

Case 2: Use the same 2TB SSD connected to PCIe, Gen 3, x4 lane bus and measure continuous sequential write until the drive is 100% full. This is done on Intel PC using Linux command ‘dd’. Result (average write speed): >1GB/s.

It could be that I am wrong and that “there are too many factors at play to know the cause”, but in my opinion the only thing that has been changed is HW platform (CPU + IO).

There is one more indicative thing. Two different customers (Friedrich and me) used two different methods and got similar results. Now, If we see someone posting setup and test results showing different results, that would be different story.

After all, I still hope that I am wrong and that someone from Nvidia will be able to explain how to improve PCIe performance on Xavier AGX.

sfr · November 25, 2019, 2:31pm

Maybe we should split this topic. The starting point for us was unexpectedly low raw PCIe bandwidth when the CPU writes data to PCIe device memory space. Below is an illustration of our setup. The adapter cards feauture a non-transparent PCIe switch with a DMA engine.

Adapter               Adapter
                 Card                  Card
+-------+ PCIe  +-------+             +--------+ PCIe  +-------+
|Xavier | slot  |Non-   | PCIe cable  |Non-    | slot  |PC     |
|Root   +-------+transp.+-------------+transp. +-------+Root   |
|Complex|       |bridge |   Gen3/x8   |bridge  |       |Complex|
+-------+       | +DMA  |             | +DMA   |       +-------+
                +-------+             +--------+           |
                                                        +-----+
                                                        | RAM |
                                                        +-----+

If the PCIe traffic towards the PC’s RAM is generated by the Xavier CPU as Programmed I/O (PIO) we see the bandwidth as shown in the original post, i.e. ~1500 MB/s.

However, if the DMA engine generates the traffic we see more than 5200 MB/s. So the Xavier’s PCIe interface itself is not the bottleneck. The major difference to PIO is that the DMA engine can create TLPs with 256 byte payload, while the CPU generates only 16-byte TLPs.

@vidyas: Thanks for your comment. Sharing the test program is a bit complicated since it does only work with our NTB driver and hardware. However, in principle it mmap’s a portion of the NTB’s PCIe memory into user space (we tried various mapping attributes) and performs memcpy() from local memory to remote PCIe memory. The NTBs are set up so that traffic is redirected to the PC’s RAM.

On PCs we typically see TLP sizes of 64 bytes.

slavisa.zigic · November 25, 2019, 3:44pm

Sfr, thank you for details about your setup. I went back and checked your first post. It seems like that with DMA transfers, you have decent speed (throughput) in both directions.
I am assuming that you are using Broadcom (former PLX) PEX87xx switches set to NTB mode.

Topic		Replies	Views
PCIe DMA on Tegra (Xavier NX) Jetson AGX Xavier kernel	24	1861	July 13, 2022
What is the actual maximum speed of Jetson AGX Xavier PCIE Ethernet? Jetson AGX Xavier pcie , ethernet	14	3537	May 5, 2022
Does the Nvidia Xavier support NVME GEN4 SSD? Jetson AGX Xavier nvme	24	8117	October 18, 2021
Installed HighPoint SSD7505 PCie 4.0 x16 On Xavier AGX get less than x8 lanes performance Jetson AGX Xavier pcie	34	1879	October 18, 2021
Boot Jetson AGX Xavier from SD on jetpack 5.1.2 Jetson AGX Xavier boot	30	1091	January 30, 2024
PCIe 10gbps throughtput issue Jetson AGX Xavier	20	2211	April 1, 2019
NVIDIA Announces Jetson Xavier — Now Shipping! Jetson TX2	54	13964	October 31, 2018
Imbalanced Performance between Read and Write Performance Jetson AGX Xavier	19	2054	December 14, 2018
Root-endpoint PCIe communication between Xaviers with a bridge in the way Jetson AGX Xavier pcie	10	1726	October 18, 2021
Xavier not routing PCI interrupts across PEX8112 bridge Jetson AGX Xavier	25	3488	October 18, 2021

Xavier PCIe performance

Related topics