Jetson AGX Orin AER: Uncorrected (Fatal). How to recover without reboot?

Hi!

Our setup:

Video source, Jetson AGX Orin and Monitor are grounded.

Issue:

We see randomly such kernel errors:

  • First error type:
    pcieport 0005:00:00.0: AER: Corrected error received: 0005:00:00.0

  • Second error type:
    pcieport 0005:00:00.0: AER: Uncorrected (Fatal) error received: 0005:00:00.0
    pcieport 0005:00:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
    blackmagic-io 0005:01:00.0: AER: can’t recover (no error_detected callback)

After first type of errors there is no effects on capture card. It continues to work.
After second type of errors capture card stop capturing and only PC reboot helps.

We need to have a way to recover capture card without PC reboot.

We observed some cases that lead to these errors:

  1. Frequency of these errors is dependent on PCIe riser. With some risers they can happen some times per second and with others - one time per several days.
  2. SDI cable connected but errors can randomly happen after enabling video source.
  3. SDI cable disconnected from video source but connected to Decklink. When we bring cable closer to video source, but not even connect it - errors can happen. This situation happens with higher probabiliy if to take a walk on carpet first.
  4. Our setup has metal enclosure. Decklink is screwed to it via metal mounting plate which is put on SDI connectors array. If we touch enclosure by hand this errors can appear.

What we tried:

  1. Capture card driver modification

We have source code for this capture card driver. In original code there was no pcie_error_handlers. We have implemented them in such way:

static pci_ers_result_t mydevice_error_detected(struct pci_dev* dev, pci_channel_state_t state) {	
	int ret;
	printk(KERN_INFO "mydevice PCIe error detected, state = %d", state);
	if (state == pci_channel_io_perm_failure) {
		return PCI_ERS_RESULT_DISCONNECT;
	}
	if (state == pci_channel_io_frozen) {
		return PCI_ERS_RESULT_NEED_RESET;
	}
	return PCI_ERS_RESULT_CAN_RECOVER;
}

static pci_ers_result_t mydevice_slot_reset(struct pci_dev *pdev) {
	bmio_device_t* dev = (bmio_device_t*)pci_get_drvdata(pdev);
	printk(KERN_INFO "mydevice PCIe slot reset\n");
	return 0;
}

static void mydevice_error_resume(struct pci_dev *pdev) {
	bmio_device_t* dev = (bmio_device_t*)pci_get_drvdata(pdev);
	printk(KERN_INFO "mydevice PCIe resume\n");
}

static struct pci_error_handlers mydevice_error_handlers = {
   .error_detected = mydevice_error_detected,
   .slot_reset = mydevice_slot_reset,
   .resume = mydevice_error_resume,
};

But after second type error capture card was not recovered. And we see such logs:

[ 230.772720] pcieport 0005:00:00.0: AER: Corrected error received: 0005:00:00.0
[ 230.772735] pcieport 0005:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 230.788044] pcieport 0005:00:00.0: device [10de:229a] error status/mask=00000001/0000e000
[ 230.796635] pcieport 0005:00:00.0: [ 0] RxErr
[ 230.802939] pcieport 0005:00:00.0: AER: Uncorrected (Non-Fatal) error received: 0005:00:00.0
[ 230.802948] pcieport 0005:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 230.814453] pcieport 0005:00:00.0: device [10de:229a] error status/mask=00004000/00400000
[ 230.823061] pcieport 0005:00:00.0: [14] CmpltTO (First)
[ 230.830049] mydevice PCIe error detected, state = 1
[ 230.830087] mydevice PCIe resume
[ 230.830102] pcieport 0005:00:00.0: AER: device recovery successful
[ 230.830105] pcieport 0005:00:00.0: AER: Uncorrected (Fatal) error received: 0005:00:00.0
[ 230.830112] pcieport 0005:00:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
[ 230.841102] pcieport 0005:00:00.0: device [10de:229a] error status/mask=00000020/00400000
[ 230.849692] pcieport 0005:00:00.0: [ 5] SDES
[ 230.855971] mydevice PCIe error detected, state = 2
[ 231.879697] pcieport 0005:00:00.0: AER: Root Port link has been reset
[ 231.879709] mydevice PCIe resume
[ 231.879755] pcieport 0005:00:00.0: AER: device recovery successful

  1. Recover capture card with commands:

We tried this script. It recover capture card but sometimes system totally hangs.

cd /sys/bus/platform/drivers/tegra194-pcie
echo 141a0000.pcie | sudo tee unbind

cd /sys/bus/platform/drivers/reg-fixed-voltage
echo fixed-regulators:regulator@114 | sudo tee unbind

echo 349 | sudo tee /sys/class/gpio/export
echo 0 | sudo tee /sys/class/gpio/PA.01/value
echo 1 | sudo tee /sys/class/gpio/PA.01/value

echo fixed-regulators:regulator@114 | sudo tee bind

cd /sys/bus/platform/drivers/tegra194-pcie
echo 141a0000.pcie | sudo tee bind

Questions:

  1. According to this link https://lwn.net/Articles/162550/

“If slot_reset() is not supported, link_reset() can
+be called instead on a slot reset.”

So we think that slot_reset is not supported on Jetson Linux. How to implement slot reset? Can you provide kernel patch for this?
2) Is there possibility to do pcie reset with commands?

I can’t answer, but something which might be related is that hot plug is not enabled by default on Jetson PCI. I don’t know the answer of how to enable this on Orin, but someone might be able to tell how to enable hot plug to try.

The reality though is that your physical layer has inadequate signal quality. You’re going to have continuous issues trying to constantly fix this.

Thank you for fast reply! We also found that if we replace Jetson to x86 machine and other parts of the setup leave the same - error does not occur.

RF signal quality and noise depend on length of traces, bends in the path, overall impedance, so on. It isn’t a case of the part being broken or working. Combinations of components have to maintain impedance and trace length limitations. About the only way to know for certain is to use a PCIe analyzer (which is of extremely high cost). About all we know is the physical layer is complaining.