Xavier 10G pcie switch bandwidth is low,[ksoftirQD /0] load Is too high

We use jetpack4.6 filesystem,Xavier PCIE_C4 connect a marvell 2lane gen3.0 pcie 10G switch, It seems to be working fine, but when iperf3 was used to test the bandwidth, it only reached 1 Gbps and [ksoftirQD /0] load rate is close to %100.

The pcie device tree is set to:
pcie@14160000 {
status = “okay”;

	nvidia,pex-wake = <&tegra_main_gpio TEGRA194_MAIN_GPIO(L, 2)
    vddio-pex-ctl-supply = <&p2888_spmic_sd3>;
    nvidia,disable-aspm-states = <0xf>;

    nvidia,max-speed = <3>;
	num-lanes = <2>;

    phys = <&p2u_8>,

    phy-names = "pcie-p2u-0", "pcie-p2u-1";

Could you please give me some advice on how to solve this problem? Thank you!

Have you tried to set the system to maximum performance to see if can improve?
See NVIDIA Jetson Linux Driver Package Software Features : Clock Frequency and Power Management | NVIDIA Docs

Yes,I perform sudo nvpmodel -m 0; sudo jetson_clocks to set the system to maximum performance;
It can slightly increase the speed measurement bandwidth, ksoftirQD /0 Occupancy is still high

I am curious, before you test, and then again after testing for a short time, what do you see from this?

egrep '(CPU|qos|ether|^IPI)' /proc/interrupts
# Note the following shows "ksoftirqd/number", where "number" is 0-based core (1 ksoft per core):
ps -eo pid,tid,pri,class,pcpu,cmd | egrep '(ksoft|COMMAND|UID|PID|CMD)' | egrep -v grep

One of the weaknesses of Jetsons is that many hardware IRQs must be on CPU0, which could cause IRQ starvation. The “/proc/interrupts” file is strictly about hardware IRQ. I wouldn’t think ksoftirqd would have this issue since it can run on any CPU, but those IRQs of course must be handed off from the hardware IRQ. If the hardware IRQ is running too fast, then ksoftirqd would actually be starved (not what you described, but it would be interesting to see if the hardware IRQ producing the software IRQ is saturated, or if instead a non-saturated hardware IRQ produces a saturated software IRQ).

Before the iperf3 test:

During the iperf3 test:

During iperf3 testing, I saw ksoftirqd slowly increase on CPU0,When I stopped testing, it will come down slowly.

pls update the max-speed to 4 instead of 3, otherwise you will fallback to gen1.

  nvidia,max-speed = <4>;

I tried but nothing improved. In fact, originally I set max-speed to 4.

Some observations regarding the IRQ info and work load…

I saw only TX and general IRQ for hardware, I did not see any RX interrupt in hardware. It would be useful to see logs after the test had been going for some time, but note that only CPU0 is used for hardware IRQ. Software IRQ (which is ksoftirqd) could in theory migrate, but it is all on CPU0. Typically the scheduler, if naive, will try to keep the software IRQ on the same core as a means of avoiding cache misses, but if the software IRQ adds too much of a load to the CPU0, and starts starving hardware IRQ servicing, then it is probably better off migrating to a new core and living with the cache miss. I’m not certain with this hardware the best way to test migrating ksoftirq/0 to another core (e.g., ksoftirqd/7), but it would be an interesting test.

I do wonder though why the ksoftirq/0 is so high. Network servicing is normally a significant part of workload, and it might just be the fact that iperf is purposely trying to load the system down as a test, but I’d think it would perform better even under those circumstances. I’d really like to see what happens if “ksoftirqd/0” becomes “ksoftirqd/7”.

Could you share your pcie Link status with max-speed = 4?

sudo lspci -vvv:

0004:01:00.0 Ethernet controller: Marvell Technology Group Ltd. Device 0f13 (rev 01)
Subsystem: Marvell Technology Group Ltd. Device abcd
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 39
Region 0: Memory at 1740000000 (64-bit, non-prefetchable) [size=1M]
Region 2: Memory at 1740100000 (64-bit, non-prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable- Count=1/32 Maskable+ 64bit+
Address: 0000000000000000 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [70] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x2, ASPM L0s L1, Exit Latency L0s <1us, L1 <32us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x2, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [b0] MSI-X: Enable+ Count=64 Masked-
Vector table: BAR=2 offset=00000000
PBA: BAR=2 offset=00001000
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [148 v1] Device Serial Number 00-00-00-00-00-00-00-00
Capabilities: [158 v1] #19
Capabilities: [168 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
L1SubCtl2: T_PwrOn=40us
Capabilities: [178 v1] #22
Capabilities: [184 v1] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?> Capabilities: [284 v1] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
Kernel driver in use: oak

The link status is under expectation, would you mind to share your iperf3 command line?
What’s the size of MTU are u test with, could you test with 64K and share here?

When the size of MTU is set to 1500:

When the size of MTU is set to 9000:

For bandwitdth test UDP is recommended and the bottleneck would be from the ethernet stack.
You can try with MTU size = 64K with UDP to see the improvement, but totally PCIEgen3 with 2lane would be upto 16G theoretically. U can also monitor the IRQ traffic with MTU 64K to see difference.

For further confusion, pls update the test here.

I try with MTU size = 64K,The test result is the same as mtu size = 9K.
When the size of MTU is set to 65536:

When UDP test with MTU size = 64K,[ksoftirqd/0] is not high:

When TCP test with MTU size = 64K,The load of [ksoftirqd/0] is very high:

So I think the udp test bandwidth goes up because it doesn’t cause the load of [ksoftirqd/0] to increase.Or is there a way to solve the Ethernet Stack bottleneck for tcp?

It’s decided by the ethernet protocal and the stack, it’s not something that samilar with PCIE for bare data transation.

As you can get 8G+ performance, it’s close to the peak, if you are still want to improve on the perf, I would suggest to consult with vendor on jumbo frame support.

The perf is aimed to prove the capability which may not always can be reproduced on this performance in your real case.