Our PCI-E frame grabbers do not work properly on the Jetson TX1 platform but they do run on another AArch64 platform (X-Gene based).
After a short while the acquisition rate dramatically drops and the linux system becomes unresponsive.
When the problem occurs ioread*() calls onto grabber’s mapped registers do take an excessive amount of time (more than 1 ms) to complete and this causes the system unresponsiveness as these ioreads occur frequently and from ISR too.
The PCI-E protocol analyzer we have used shows that when the system is broken:
- read requests from the Jetson reach our grabber which responds as expected without delay (less than 5µs)
- the delay between the response sent by the grabber and the time the driver on the Jetson "sees" the response returned from ioread*() is very long: ~1.3ms
- there is no more DLLP UpdateFC packets coming from the Jetson TX1 for a very long period of time (before the issue occurs, all DLLP UpdateFC packets are coming on a regular basis)
Here are some facts and trials:
- we have multiple Jetson TX1, they all behave the same way
- the system never recovers from the slowed down behavior, rebooting the Jetson is needed to go back to a normal situation
- we have tried with L4T r24.1, r24.2 and r24.2.1
- forcing PCI-E to gen 1 exhibits the bug more quickly: would the problem be related to data transfers reaching the bandwidth limit?
- changing EMC clock rate seems to improve the situation but does not solve it: the acquisitions run normally for a longer period of time
- running jetson_clocks.sh does not solve it either
- on r24.2.1 we have removed the line iommus = <&smmu TEGRA_SWGROUP_AFI> in arch/arm64/boot/dts/tegra210-jetson-cv-base-p2597-2180-a00.dts, this did not solve the problem but slightly reduced the bandwidth
- as suggested in https://devtalk.nvidia.com/default/topic/979635/jetson-tx1/ethernet-speed-increases-when-micro-usb-2-0-connector-is-connected/2 we tried without success to set /sys/kernel/debug/clock/floor.rate to /sys/kernel/debug/clock/floor.sclk/max
- adding extra PCI kernel debugging does not reveal any problem
Finally, we have noticed that the tegra linux kernel config sets CONFIG_PSTORE_RTRACE=y which defines pstore_register_rtrace() used inside __raw_write*() and __raw_read*() functions, these ones are used inside ioread*() and iowrite*() macros: this suggests that you are investigating issues with ioreads/iowrites on the tegra platform. Is that correct?
Would you have some advices to help us solve this critical issue?