Let me see if I understand the whole picture here correctly.
So, we have an endpoint system connected to the root port and the endpoint has a BAR exposed to the host system.
From the root port side, we are trying to write some data to the EP’s BAR and that’s when we observe this less perf. right?
Here, the writes from RP can be initiated either by the CPU or by the DMA engine (which is embedded inside the root port itself). Looks like both the methods were tried out here and there is no difference in the perf. right?
First things first…
Since the controller is being operated in x1 configuration, please add “nvidia,update_fc_fixup;” to the DT entry of the respective controller.
If the CPU is being used to initiate the downstream write traffic to the EP BAR, then, please note that there is a system limitation because of which the payload size can’t be bigger than 64 bytes. This could be one of the reasons why perf is less. Also, usually in a PCIe system, data transfers are initiated by the DMA engines present in the endpoint and the host CPU is not used to do any data path transfers to/from the EP’s BAR.
Since you also mentioned that you did use the experimental DMA functions of the Root port’s DMA engine, I wanted to know how did you use the DMA engine of the root port exactly? I mean it has a one-shot mode and linked list mode. which one is used here?