We have a user-space benchmark that uses PIO to write to remote memory which is mapped via PCIe all the way into the PC’s RAM. The mapping is done by the driver using remap_pfn_range() and pgprot_writecombine.
The default implementation of pgprot_writecombine uses the MT_NORMAL_NC attributes.
Here are the results:
Function: sciMemCopy_OS_COPY (0)
Segment Size: Average Send Latency: Throughput:
4 0.64 us 6.28 MBytes/s
8 0.63 us 12.72 MBytes/s
16 0.63 us 25.29 MBytes/s
32 0.64 us 50.26 MBytes/s
64 0.68 us 94.17 MBytes/s
128 0.69 us 185.21 MBytes/s
256 1.00 us 254.87 MBytes/s
512 0.83 us 616.30 MBytes/s
1024 1.14 us 897.26 MBytes/s
2048 1.75 us 1169.96 MBytes/s
4096 2.92 us 1400.75 MBytes/s
8192 5.17 us 1584.53 MBytes/s
16384 17.01 us 963.08 MBytes/s
32768 33.72 us 971.85 MBytes/s
65536 67.19 us 975.33 MBytes/s
“Average Send Latency” is the time the CPU spends in the STORE instruction to PCIe memory, averaged over 50000 iterations. Compared to other systems, 0.63 usec is rather high. Usually we see less than 0.10 usec.
Also the throughput is well below the capacity of a Gen3/x8 link. When we turn around the direction and the PC writes to the Xavier we see nearly 5700 MBytes/s. A PCIe analyzer revealed that the Xavier generates only 16-Byte TLPs.
The next thing we tried was mapping with MT_DEVICE_GRE. “Gather” should improve throughput by generating larger TLPs and “Early-acknowledge” (write posting) should bring down the send latency. However, the results are not much different from the original ones:
Function: sciMemCopy_OS_COPY (0)
Segment Size: Average Send Latency: Throughput:
4 0.63 us 6.33 MBytes/s
8 0.63 us 12.70 MBytes/s
16 0.63 us 25.41 MBytes/s
32 0.64 us 49.91 MBytes/s
64 1.18 us 54.32 MBytes/s
128 2.25 us 56.88 MBytes/s
256 2.29 us 111.75 MBytes/s
512 2.37 us 216.06 MBytes/s
1024 2.59 us 395.82 MBytes/s
2048 3.58 us 572.76 MBytes/s
4096 4.37 us 937.33 MBytes/s
8192 6.20 us 1321.27 MBytes/s
16384 18.13 us 903.79 MBytes/s
32768 34.84 us 940.46 MBytes/s
65536 68.16 us 961.44 MBytes/s
Our questions are:
Is it possible to have the PCIe port generate MWr TLPs larger than 16 byte for PIO writes?
Is there another way to enable Early Write Acknowledgement to PCIe space?
Well, as I heard from our hardware folks, there is no limit set at the hardware level for 16-byte TLP generation. It could probably be coming from the userspace software i.e. based on how SW is written.
Is it possible to share the userspace benchmarking tool being used here? Also, what is the TLP size (in terms of bytes) when TLPs are generated by x86 PC?
Here is one more user of Xavier AGX with a concern about performance of PCIe Gen.3 interface.
Our test is simpler than the one explained above and consists of measuring throughput using SSD drive plugged into M.2 connector of Jetson AGX Xavier Developer Kit.
This is about testing PCIe Gen.3 x4 lane interface.
Performance was compared with Intel x86 PC.
When testing read/write speed of M.2 SSD (2TB, model number can be provided if needed) we got about 400MB/s when using Xavier AGX. On x86 PC, we got more than 1GB/s. The test was performed using Linux dd if=/dev/zero … when writing and also ‘dd’ when reading.
It would be nice from Nvidia to provide actual throughput that can be achieved between PCIe device and Xavier AGX memory (16GB 256-bit LPDDR4x 137GB/s).
I have experienced identical speeds on PC and Xavier with a Samsung EVO 960 nvme ssd in the key M slot under the heat sink. I used hdparm -tT to benchmark. You should be aware that most SSDs have multiple levels of cache and as such performance many vary depending on how the cache is managed, how full the drive is, how old it is, whether you are reading or writing, whether the cache is flushed, etc.
In your test you are writing to the drive. Writes are routinely sent to fast, single level flash, and later flushed to multi level flash which is slower. If that was just performed after doing the same thing on PC, the cache may not have flushed. I recommend using hdparm -tT for consistent results since it’s a tool designed in part for this kind of benchmarking. Here are my results (nearly identical to the x86 machine the SSD was in before). I expect they would be even better with a faster, newer, SSD.:
Timing cached reads: 6106 MB in 1.99 seconds = 3063.05 MB/sec
Timing buffered disk reads: 3684 MB in 3.00 seconds = 1227.98 MB/sec
In my test, SSD was tested for continuous (sequential) read/write. In other words, I was writing/reading for the very long time. More precisely, full capacity write test (total of 2TB writes, until the drive is full). We are talking here about almost half an hour continuous write. That is why cache levels should not be part of the equation. In other words, in first few seconds of write cycle, all of the SSD cache would be filled and then we have only speed of PCIe bus and actual write speed of flash memory on SSD.
Since whole test executes faster on x86 PC (the same OS on both platforms), everything is pointing in direction of PCIe Gen.3 speed on Xavier AGX. Actually, the numbers related to PCIe Gen.3 x4 speed test are almost matching with detailed report of PCIe Gen.3 x8 posted above. The only difference is scaling factor, because x8 lanes are used instead of x4 lanes and hence double the speed.
In this case, you are benchmarking the drive, not the interface.
I believe you may be partially mistaken. After the first few second/minutes of writing, depending on the size, your SLC (fast flash) is filled up, after which you drive will write to slow multi level flash (TLC/QLC). At that point, the bottleneck is the slow flash memory on the drive. This can be as bad as spinning disk speeds if you have something like an Intel 660p series SSD. 400MB/sec is not unexpected.
That is for the 250GB model. It’s better for larger variants. Samsung PRO series is designed differently and uses large amounts of fast flash instead of a fast flash cache backing slow flash. If you need sustained writes at a consistent speed, that will do better (2300 MB/s), but will still not saturate the link.
Did you try the test repeatedly on x86 with no pause in between? Or was the test on Xavier performed immediately after the x86 test? As stated above, there are a host of other factors (ram, system caching policy) that could cause the difference in performance besides the flash cache levels. It’s recommended to use a tool designed for benchmarking tool can account for these like hdparm.
Storage manufacturers are the worst. They always have been. The whole base2 vs base10 disk size confusion, it’s countless related bugs, and lawsuits, is due to them. SSD write speeds are just the latest in a long list of lies. For most drives, for most people, nobody will ever notice a difference, but try to copy 1/2 TB over to the drive and watch performance hit a wall quickly.
I would go back to original topic “Xavier PCIe performance”. Since we are using SSD to test PCIe throughput, let us eliminate parameters that are not important, like amount and speed of cache on SSD. For 2TB SSD drive, and continuous sequential write, SSD cache size is not important, since cache is saturated after few seconds (cache size is not more than 8GB). After cache is saturated we have only two factors: write speed of PCIe bus and write speed of SSD flash memory.
My observation (as well as observation from Friedrich) is that PCIe speed on Xavier AGX is far below the limit and what x86 PC can do. Friedrich used PCIe analyzer, I used SSD and we got very similar results. When I use the same SSD and plug it into x86 PC, measure more than 1GB/s write speed (using same OS and same Linux command) and then plug it into Xavier AGX and measure 400MB/s, I think that discussion about SSD cache, memory organization on SSD and benchmarks is out of question.
I still live in hope that there is a way to set up Xavier AGX so that we will see decent PCIe speed.
This is non-trivial question for Nvidia, because they used Synopsis IP Core for PCIe express.
Although this IP core is advertised as PCIe Gen.3 and Gen.4 capable, we are observing speeds well below the expected for PCIe Gen.3.
Let’s agree to disagree about the importance of that. In my view you’re only testing the speed of the slowest flash on your drive when you do sustained writes like this.
Which is actually a lot lower than the PCIe link itself. I think you may be confusing the cache on the drive with other caches. Most SSDs slow down on sustained writes, including yours. Reviews report it slowing down after sustained writes.
To be fair, they don’t say what model they’re testing and larger drives are usually faster, but there isn’t a lot of detailed information about your drive out there. You say it’s faster on x86. Maybe it is, but there are too many factors at play to know the cause for sure. I’d suggest testing with an NVIDIA GPU but they aren’t yet supported. You could try something like this, but it’s likely expensive af. Any other suggestions of things to stick in the pci slots? If you have two Xaviers, you could test that way.
Mdegans, can you please try to explain what impact cache and flash speed has in the following setup. To be be more precise how cache and flash are affecting this measurement:
Case 1: Use 2TB SSD connected to PCIe, Gen 3, x4 lane bus and measure continuous sequential write until the drive is 100% full. This is done on Xavier AGX using Linux command ‘dd’. Result (average write speed): ~400MB/s.
Case 2: Use the same 2TB SSD connected to PCIe, Gen 3, x4 lane bus and measure continuous sequential write until the drive is 100% full. This is done on Intel PC using Linux command ‘dd’. Result (average write speed): >1GB/s.
It could be that I am wrong and that “there are too many factors at play to know the cause”, but in my opinion the only thing that has been changed is HW platform (CPU + IO).
There is one more indicative thing. Two different customers (Friedrich and me) used two different methods and got similar results. Now, If we see someone posting setup and test results showing different results, that would be different story.
After all, I still hope that I am wrong and that someone from Nvidia will be able to explain how to improve PCIe performance on Xavier AGX.
Maybe we should split this topic. The starting point for us was unexpectedly low raw PCIe bandwidth when the CPU writes data to PCIe device memory space. Below is an illustration of our setup. The adapter cards feauture a non-transparent PCIe switch with a DMA engine.
If the PCIe traffic towards the PC’s RAM is generated by the Xavier CPU as Programmed I/O (PIO) we see the bandwidth as shown in the original post, i.e. ~1500 MB/s.
However, if the DMA engine generates the traffic we see more than 5200 MB/s. So the Xavier’s PCIe interface itself is not the bottleneck. The major difference to PIO is that the DMA engine can create TLPs with 256 byte payload, while the CPU generates only 16-byte TLPs.
@vidyas: Thanks for your comment. Sharing the test program is a bit complicated since it does only work with our NTB driver and hardware. However, in principle it mmap’s a portion of the NTB’s PCIe memory into user space (we tried various mapping attributes) and performs memcpy() from local memory to remote PCIe memory. The NTBs are set up so that traffic is redirected to the PC’s RAM.
Sfr, thank you for details about your setup. I went back and checked your first post. It seems like that with DMA transfers, you have decent speed (throughput) in both directions.
I am assuming that you are using Broadcom (former PLX) PEX87xx switches set to NTB mode.