PCIe IRQ latency unbounded value

I’m using the Jetson AGX as a root port and a Xilinx dev board as PCIe endpoint.
PCIe connection is well established and data correctly exchanged.
For my application, about 20 kB of data must be read from FPGA DDR to Jetson every milliseconds.

I configured a user IRQ sent by the endpoint over PCIe to inform Jetson that data are available.
I’m using MSI-X interrupts.
By default, IRQ affinity is setup to the CPU3 core of the Jetson AGX.

Here is a view of the IRQ associated with my PCIe driver :

cat /proc/interrupts | grep xdma
820: 0 0 0 0 0 0 0 0 PCI-MSI 0 Edge xdma
821: 0 0 0 0 0 0 0 0 PCI-MSI 1 Edge xdma
822: 144097 0 0 0 0 0 0 0 PCI-MSI 2 Edge xdma

IRQ #822 is the one raised when a user interruption is sent over PCIe from the FPGA.
IRQ #821 is called when reading data from FPGA (DDR)

I’m using a RT-PREEMPT patched kernel and I assagned IRQ thread’s to CPU core #3. (thanks to the “taskset” command).
I also set these thread’s priority to 80 (versus 50 by default for IRQ threads).
A last, I isolated the CPU core #3 adding isolcpus = 3 to the APPEND line in the file /boot/extlinux/extlinux.conf.

Thus, I’m expecting that cpu core #3 is only dedicated to handle these PCIe interrupts (user and read).
However, I’m experiencing some jitter but the most problematic to being very high latencies.
Indeed, if the average value is about 100us, the max observed value reach several milliseconds.
→ see the following latencies histogram which represents the transfer time of my data from FPGA to Jetson (about 20 kB every millisecond) :

Such latencies are unacceptable for my application and I must absolutely bound them.

Do you have any suggestions that could help bounding the transfer time ?

Are you able to successfully get interrupts serviced only by CPU-3?

can you try after seting nvpmodel -m 0 and running jetson_clocks so that cpus are at max clock?

Yes, the interrupts associated are only serviced by CPU#3.

With “ps” command I’ve checked that only interrupts related to my PCIe application are serviced on CPU3 and that no other process is affected on this CPU.

I forgot to told it but the tests are already run after setting nvpmodel -m 0 and running jetson_clocks script.

Please try the below steps and share if any improvements.

  1. Disable cpuidle states by writing ‘1’ to ‘/sys/devices/system/cpu/cpu*/cpuidle/state*/disable’ sysfs node (or) disable cpu idle “CONFIG_CPU_IDLE” in kernel defconfig.
  2. Pass “nohz=off” in kernel boot arg to disable the tickless kernel/dyntick-idle mode.
  3. Change CONFIG_HZ_1000=y to have near realtime interrupt freq.
1 Like

Hi,

I’ve tried the things you mention.
My kernel was already built with the CONFIG_HZ_1000=y option. I although recompiled it disabling CONFIG_CPU_IDLE.

I also passed “nohz=off”.

Unfortunately, results are very similar, the average latency value is approximately the same and high latency occurencies are still observed.

I don’t understand how such latencies can occur since these IRQ and their management are restricted to CPU#3 which is isolated from the scheduler.

I’m wondering if the way I measure latencies could introduce such jitter.
For information, the latencies are simply measured in a thread (also executing on CPU#3) with the clock_gettime(CLOCK_MONOTONIC, …) function, located just before and just after the pread function (responsible from reading data over PCIe with DMA access).

Any idea ?
Help would be appreciated…

Could you try below tests:

  1. CLOCK_MONOTONIC → CLOCK_MONOTONIC_RAW
  2. Set below node and check if any improvement observed.
    “echo 0x8 > /sys/kernel/debug/tegra_mce/rt_safe_mask”
  3. Try using ‘perf’ tool(Tutorial - Perf Wiki) to measure counters like ‘cpu-cycles’ instead of clock time.

If still no hints, then please share ftrace logs to check.

The high latency values I observed are not linked to the way I measured them.

Indeed, I recompiled the Xilinx driver in “polling” mode so that a kernel thread check constantly if the transfer is done.
In the previous version the FPGA was sending an MSI-X interrupt at the end of a transfer.

The results are far better, indeed, with this “polling” mode, the latency values are contained within 165us.
(I modified the driver so that the kernel thread run only on cpu#3).

I will work with this option for now.

(However I still do not understand such latencies in “interupt” mode since IRQ management is set to CPU#3 which is isolated from the scheduler).

by the way, thanks for your advices.

1 Like