PCIe TLP Size

Is there any way we can control the PCIe TLP size used on the NX as root port?

We observe < 400 MB/s write throughput to a simple endpoint:
NX(RP) --> x1 --> EP

(I believe we are using controller C5, not sure if that matters)

We also tried to use the experimental DMA functions in the NX PCIe driver subsystem; but achieve the same results.

Any suggestions for performance are appreciated…

TLP is the last thing to play around with w.r.t performance tweaking as that would have been set correctly by the Linux kernel’s PCIe sub-system already.
What is the speed at which the PCIe link is operating here? Have you observed higher perf with the same link width and speed with any other host other than NX?

Hi Vidyas! Thanks for the help earlier on other topics. :)

The link is operating at Gen3 speeds.

We have tested other Root Port Hosts connected to the EP. The throughput is much higher (closer to theoretical max 1000 MBps).

Let me see if I understand the whole picture here correctly.
So, we have an endpoint system connected to the root port and the endpoint has a BAR exposed to the host system.
From the root port side, we are trying to write some data to the EP’s BAR and that’s when we observe this less perf. right?
Here, the writes from RP can be initiated either by the CPU or by the DMA engine (which is embedded inside the root port itself). Looks like both the methods were tried out here and there is no difference in the perf. right?
First things first…
Since the controller is being operated in x1 configuration, please add “nvidia,update_fc_fixup;” to the DT entry of the respective controller.

If the CPU is being used to initiate the downstream write traffic to the EP BAR, then, please note that there is a system limitation because of which the payload size can’t be bigger than 64 bytes. This could be one of the reasons why perf is less. Also, usually in a PCIe system, data transfers are initiated by the DMA engines present in the endpoint and the host CPU is not used to do any data path transfers to/from the EP’s BAR.

Since you also mentioned that you did use the experimental DMA functions of the Root port’s DMA engine, I wanted to know how did you use the DMA engine of the root port exactly? I mean it has a one-shot mode and linked list mode. which one is used here?

Hi Vidyas,

Yes, you are right on the two first points.

  • We try to write a large BAR and observe less performance
  • RP can write using both DMA and non-DMA

I will test the device tree setting.

We used the function in pcie-tegra.c; not the linked list mode, but just one large DMA (don’t remember exactly how large anymore, at least > 64kB and probably larger I think). It was the exact same performance.

By the way: where could we check the negotiated TLP size?

in ‘sudo lspci -vv’ output.

I feel 64KB is still small. As an experiment, can you try something like 512MB??
Also, is the system loaded with any other thing other than PCIe task?

I found our logs from the DMA test.

The verified throughputs are as follows (64 MB DMA to/from 64 MB BAR):

  • MEM -> BAR : 844 MBps
  • BAR -> MEM : 760 MBps

In this case we were testing on an x2 link. Halve this and we’re down to the “400 MBps” bandwidth on our x1 link.

Hi Vidyas,

I added the following to the device tree:
nvidia,update_fc_fixup;

But still observe the low performance from RP -> EP. In the other direction (EP->RP) bandwidth is twice as high…

I will have a look at the negotiated link parameters.

I found some performance counters on our EP that can count 1) the total transferred size in bytes and 2) the number of TLPs. The average observed TLP payload size is 15.9 bytes. So I assume the NX is writing with TLP payload size of 16 bytes.

Another root port we have tested yielded a payload size closer to 64 bytes.