PCIE x4, only 658MB/s

Hi all,

we have tested PCIE transfer bandwidth between TX2 & FPGA (soldered on the same PCB),

FPGA forms consecutive MWr(32 bit bus addressing, i.e., 3DW TLP header) TLPs, with 128B payload
(since Max_Payload_Size supported by TX2 is just 128B, that is te maximal payload size for a MWr TLP),

but we have observed long duration of ‘bus-busy’ state,
(during this state, no data can be transfered since TX2 throttles
the PCIE logic inside the FPGA)

so, just as expected, the actual speed for PCIE between FPGA & TX2 is only 658MB/s (much too lower than expected !!!).

when the same ‘FPGA & firmware’ works with x86 CPU, the measured bandwidth is: 2GB/s.

What causes the difference between the actual PCIE bandwidth between ‘TX2+FPGA’ & ‘x86+FPGA’

note that, the theoretical bandwidth for PCIE Gen2 x4 (both ‘TX2+FPGA’ & ‘x86+FPGA’ are tested in such
configuration: Gen2 x4), is 2.5GB/s, (consider the 64/66B coding effect in PCS layer, the peak bandwidth for
PCIE Gen2 x4 is 2.5 × 64 ÷ 66 = 2.42GB/s)

why does TX2 behaves so poor in PCIE bandwidth test ?

lpddr4 works at 1600MHz (or 1866MHz ??), i guess this does not make the bottleneck…

does it have anything to do with the PCIE controller ? or anything else ?

I can’t answer, but you’ll want to post the output for this particular device from “sudo lspci -vvv”. In part this will show if certain options or errors exist, and what speed the bus is both running at and what the bus is rated for.

Assuming your link is operating at Gen-2 speed with x4 link width, 658 MB/s is less. You should see around 1.5GB/s net bandwidth (after all kinds of protocol overheads).
Can you please confirm link operating speed and width?

BTW,

when the same ‘FPGA & firmware’ works with x86 CPU, the measured bandwidth is: 2GB/s.
Wondering how is this calculated?
Because, though PCIe Gen-2, x4 offers 20 Gb/s theoretical bandwidth, 4 Gb/s would go straight for 8b/10b encoding leaving 16 Gb/s and in that we have all kinds of other protocol overheads like TLP headers, Ack/Naks, FC updates Etc…
In case of Tegra, available net bandwidth (after considering all the above) is around 12.2 Gb/s which is around 1.5 GB/s
Also, can you please confirm if SMMU is enabled for PCIe in this case? If yes, please remove following from device tree to disable SMMU for PCIe and see if there is any improvement in BW

  1. #stream-id-cells = <1>;” from “tegra_pcie” node
  2. “<&{/pcie-controller@10003000} TEGRA_SID_AFI>,” from “smmu” node

1,
sorry for my mistake for calculating the pcie gen2 link,pcie gen2 still uses
8/10 encode/decode pcs sublayer…
while i employed 64/66 for calculating the net bandwidth…

2,
u are correct,our test is done for a gen2,x4 link,
this can be verified by looking at the feedback of “lspci -vv”.
lnkstat: clearly shows that 5gt/s,x4 width

3,
OK,i`ll try disabling the SMmu for pcie~

BTW,SMmu does have something TODO with the pcie bandwidth~,is that right?

BTW,SMmu does have something TODO with the pcie bandwidth~,is that right?
Depends on how frequently allocations/frees are happening. It is mostly software overhead than hardware. SMMU hardware as such take very less time to convert iova to physical address

Hi Phoenixlee. Hi vidyas.

I have a similar bandwidth problem when using a PCIe 4x USB 3.0 host controller card. I use the following card in the PCIe slot of the Jetson TX2 developer board:

No matter which or how many USB devices I connect (cameras, hard disks, network adapters), I can’t reach a bandwidth higher than 350 to 400MB/sec which is far below the limit of a PCIe 4-lane interface.

I deactivated the SMMU for PCIe in the Device Tree. Unfortunately, this does not need to be improved.
Did you fix your problem with the provided SMMU patch?

Regards.

I’m not sure if PCIe is real bottleneck here. What if the bottleneck is coming from the devices that are connected?
To make sure that PCIe bandwidth is fine, if possible, can you please connect an NVMe drive and then check?

Hi vidyas.

I have four cameras. All cameras are the same type. Each camera provides a data stream with 7077388 bytes per frame and 24 frames per second. In total 169MByte/s or 1,358Gbit/s per camera.

I have connected a PCIe (4 lanes) card with 4 dedicated USB host controllers. One camera is connected per host controller. I start the video recording with the following command:

sudo jetson_clocks. sh
GST_DEBUG=2 gst-launch-1.0 -v v4l2src device=/dev/video1! video/x-raw, format=UYVY, width=2304, height=1536, framerate=24/1! fpsdisplaysink video sink=fakesink

I start the same command for /dev/video2, /dev/video3 and /dev/video4, everything works fine for 2 cameras. As soon as I open /dev/video3, I lose frames on camera 1 or 2 and the average throughput limit is about 3.5Gbit/sec.

I have now tried to connect only 2 cameras on the PCIe card and a 3rd one on the Jetson internal USB 3.0 port. Also in this case I lose frames and the overall system bandwidth limit of the throughput is 3.5Gbps.

I repeated these attempts with USB 3.0 SSD hard disks (“dd if=/dev/sda of=/dev/null bs=10M”). The sum of the data throughput remains at 3.5Gbit/s, even if I distribute the load over the PCIe slot and the internal USB3.0 port. So I can rule out that the cameras are to blame.

In the case of cameras, tegrastats shows the following load:
RAM 552/7851MB (lfb 1731x4MB) cpu[100%@2022, off, off, 16%@2025,6%@2027,15%@2029] EMC 7%@1600 APE 150 GR3D 0%@1134

It looks like there is enough CPU power available.

I wonder if there is a Tegra internal bandwidth limit for PCIe and USB?
I can’t believe that data throughput is limited to 3.5 to 4 Gbps?
What could be the problem?

I am using a newly installed TX2 with JetPack-L4T-3.1.

Hi, I had the same problem which solves using the Nvidia Patch in this link:
https://devtalk.nvidia.com/default/topic/1027100/stream-4-cameras-with-gstreamer/

Seeing similar performance here, samsung evo pro 960 nvme ssd with pci x4 using an m.2 adapter for the m.2 key M form factor. should be getting 2000+MB/s reads at least if not 3000MB/s, only clocks in around 950MB/s

We have the same problem with PCIe and a Xilinx FPGA.
We need a fast data transmission and did a benchmark between a Desktop x64 Pc and the Jetson TX2 Board with L4T 28.2, Jetpack 3.2. Both systems have the same OS, Xilinx PCIe Driver and FPGA Hardware.

Desktop Pc (Intel i5 Quad Core, Ubuntu 16.04):

Transfer Size: Rate [byte/s]:
8-Bytes 0.438 Mbyte/s
64 Kbytes 985 Mbyte/s
1 Mbytes 1.27 Gbyte/s
32 Mbytes 1.31 Gbyte/s

Nvidia Jetson TX2 Development Board:

I set nvpmodel to 0 and executed ./jetson_clocks.sh.
I can not find anything about SMMU (tegra_pcie node or smmu node ) in the decompiled device tree file.

Transfer Size: Rate [byte/s]
8-Bytes 0.116 Mbyte/s
64 Kbytes 445 Mbyte/s
1 Mbytes 751 Mbyte/s
32 Mbytes 884 Mbyte/s

We have no idea why the TX2 is so much slower.
Is there an issue with the PCIe on TX2?
I really need help to solve this!

regards
Sebastian

Could it be the MaxPayload size?

MaxPayload 128 bytes, MaxReadReq 512 bytes

Is there a way to increase this size?

Hi,
Can you please try the following?
By default, PCIe ASPM (Active State Power Management) is enabled. Although it is not supposed to cause too much reduction in bandwidth, depending on the exit latencies , we might see considerable perf loss.
Execute

echo "performance" > /sys/module/pcie_aspm/parameters/policy

to disable ASPM completely.
Also, how are you measuring read perf exactly?
I tried doing that with the help of iozone tool with the following command line

iozone -ecI -+n -L64 -S32 -r16M -i0 -i1 -s10G -f iozone.temp

and I could get 1100 MB/s
After running jetson_clocks.sh, it went upto 1300 MB/s
BTW, 2000 MB/s is not a realistic number to achieve with Gen-2 and x4 as the spec defined speed itself is 2500 MB/s and after taking out 20% of 8b/10b overhead, it is 2000 MB/s. But we need to consider protocol overhead (Acks, FCs, TLP headers Etc…) and after all this effective usable bandwidth that Tegra’s PCIe can offer is 12.2 Gbps or 1.52 GBps

We’re having similar performance problems with 4 cameras on a PCI-E USB card. I wanted to try the suggestion above before resorting to a kernel rebuild, but out TX2s don’t have /sys/module/pcie_aspm. The closest thing is /sys/module/pcie_tegra, and there’s no parameters directory there. Any suggestions?

Do you have ASPM configs (CONFIG_PCIEASPM_POWERSAVE or CONFIG_PCIEASPM_PERFORMANCE) enabled? If not, please enable them and check

So where exactly within the plethora of config files does one check if this is enabled, and how does one go about enabling it if it’s not?

A running Jetson will have a pseudo file (it’s really in RAM and given directly by a kernel driver) with an exact match of how the kernel was built which is currently running (well, “CONFIG_LOCALVERSION” won’t be an exact match, but everything else will be):

/proc/config.gz

Example:

gunzip < /proc/config.gz | egrep 'PCIEASPM'

Thanks.

CONFIG_PCIEASPM is not set

That confirms that ASPM is not running, but I’m not sure how to enable it. I haven’t built a kernel since before Linux existed.

I remember those days…toiling over stone tablets with a chisel…

Once you have set up for kernel build it isn’t difficult. However, you have to remember that the install procedure isn’t the same on all releases, and definitely isn’t the same as a desktop PC. Start here, which leads to a second URL:
https://devtalk.nvidia.com/default/topic/1012382/jetson-tx2/usb-wifi-adapter-s-/post/5290682/#5290682

Do note that this is for native build, and comments within talk about cross compile from a PC. If you wish to cross compile, then there are just a couple of extra steps. The official Documentation link with any given L4T release gives details on cross compile.