Critical issue with PCI-E device

Our PCI-E frame grabbers do not work properly on the Jetson TX1 platform but they do run on another AArch64 platform (X-Gene based).

After a short while the acquisition rate dramatically drops and the linux system becomes unresponsive.

When the problem occurs ioread*() calls onto grabber’s mapped registers do take an excessive amount of time (more than 1 ms) to complete and this causes the system unresponsiveness as these ioreads occur frequently and from ISR too.

The PCI-E protocol analyzer we have used shows that when the system is broken:

  • read requests from the Jetson reach our grabber which responds as expected without delay (less than 5µs)
  • the delay between the response sent by the grabber and the time the driver on the Jetson "sees" the response returned from ioread*() is very long: ~1.3ms
  • there is no more DLLP UpdateFC packets coming from the Jetson TX1 for a very long period of time (before the issue occurs, all DLLP UpdateFC packets are coming on a regular basis)

Here are some facts and trials:

  • we have multiple Jetson TX1, they all behave the same way
  • the system never recovers from the slowed down behavior, rebooting the Jetson is needed to go back to a normal situation
  • we have tried with L4T r24.1, r24.2 and r24.2.1
  • forcing PCI-E to gen 1 exhibits the bug more quickly: would the problem be related to data transfers reaching the bandwidth limit?
  • changing EMC clock rate seems to improve the situation but does not solve it: the acquisitions run normally for a longer period of time
  • running jetson_clocks.sh does not solve it either
  • on r24.2.1 we have removed the line iommus = <&smmu TEGRA_SWGROUP_AFI> in arch/arm64/boot/dts/tegra210-jetson-cv-base-p2597-2180-a00.dts, this did not solve the problem but slightly reduced the bandwidth
  • as suggested in https://devtalk.nvidia.com/default/topic/979635/jetson-tx1/ethernet-speed-increases-when-micro-usb-2-0-connector-is-connected/2 we tried without success to set /sys/kernel/debug/clock/floor.rate to /sys/kernel/debug/clock/floor.sclk/max
  • adding extra PCI kernel debugging does not reveal any problem

Finally, we have noticed that the tegra linux kernel config sets CONFIG_PSTORE_RTRACE=y which defines pstore_register_rtrace() used inside __raw_write*() and __raw_read*() functions, these ones are used inside ioread*() and iowrite*() macros: this suggests that you are investigating issues with ioreads/iowrites on the tegra platform. Is that correct?

Would you have some advices to help us solve this critical issue?
Thank you.

couple of follow up questions

  1. What exactly is this ‘slowed down’ behavior? you mean your application runs slow and rest of the system is responsive? like shell etc? or the entire system becomes unresponsive?
  2. Did you try disabling CONFIG_PSTORE_RTRACE config?
  3. Are there any error prints in the log? (running ‘dmesg -n 8’ increases the log level that gets printed to console). If yes, can you please attach the log?
  4. Does your device support any of ASPM states? If yes, can you please try adding ‘pcie_aspm=off’ to kernel command line (to disable ASPM completely)?

We don’t have any outstanding issues with register read/writes at this point. Given that you are using ioread()/iowrite() APIs, Is your device exposing IO resource to system? (most of the devices expose MEM resource)