pcie driver fails after moving to latest l4t

dawnpaul · September 10, 2019, 6:45am

While communicating with a custom PCIe device that we develop we face couple of issues.
The device was working correctly in the earlier version of L4T.
The device basically bus masters data transfer from and to the L4T system memory.
The system memory is allocated using normal malloc in user space and then mapped using pci_map_sg.

Issue#1. AER (PCIe advanced error reporting utility) now reports link layer error.
Issue#2. We get Unhandled context fault from IOMMU

Version of L4T with issue

NVIDIA Jetson TX2
L4T 32.2.1 [ JetPack UNKNOWN ]
Board: t186ref
Ubuntu 18.04.2 LTS
Kernel Version: 4.9.140-tegra

relevant part of dmesg log that shows AER messages and also context fault.

[ +0.006369] pcieport 0000:00:01.0: AER: Multiple Corrected error received: id=0020
[ +0.000107] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0008(Transmitter ID)
[ +0.010532] pcieport 0000:00:01.0: device [10de:10e5] error status/mask=00001100/00002000
[ +0.008379] pcieport 0000:00:01.0: [ 8] RELAY_NUM Rollover
[ +0.006115] pcieport 0000:00:01.0: [12] Replay Timer Timeout
[ +0.006207] camera_ipu 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0100(Transmitter ID)
[ +0.011257] camera_ipu 0000:01:00.0: device [1e53:9024] error status/mask=00001100/0000e000
[ +0.008672] camera_ipu 0000:01:00.0: [ 8] RELAY_NUM Rollover
[ +0.006605] camera_ipu 0000:01:00.0: [12] Replay Timer Timeout
[ +0.006368] pcieport 0000:00:01.0: AER: Multiple Corrected error received: id=0020
[ +0.000203] camera_ipu 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0100(Transmitter ID)
[ +0.010806] camera_ipu 0000:01:00.0: device [1e53:9024] error status/mask=00001100/0000e000
[ +0.008589] camera_ipu 0000:01:00.0: [ 8] RELAY_NUM Rollover
[ +0.006333] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x17ffffc00, fsynr=0x200013, cb=21, sid=17(0x11 - AFI), pgd=22ea9e003, pud=22ea9e003, pmd=23be24003, pte=0
[ +0.000147] mc-err: vpr base=0:0, size=0, ctrl=1, override:(e01a8341, 1dc10c1, 2a800000, 2)
[ +0.000010] mc-err: (255) csw_afiw: MC request violates VPR requirements
[ +0.000006] mc-err: status = 0x00337031; addr = 0x3ffffffc0
[ +0.000005] mc-err: secure: yes, access-type: write
[ +0.000013] mc-err: unknown mcerr fault, int_status=0x00000000, ch_int_status=0x00000200, hubc_int_status=0x00000000
[ +0.000011] mc-err: unknown mcerr fault, int_status=0x00000000, ch_int_status=0x00000200, hubc_int_status=0x00000000
[ +0.000013] mc-err: unknown mcerr fault, int_status=0x00000000, ch_int_status=0x00000200, hubc_int_status=0x00000000
[ +0.000043] mc-err: Too many MC errors; throttling prints

The pcie driver was working correctly in the following l4t version

NVIDIA Jetson TX2
L4T 28.2.1 [ JetPack 3.3 or 3.2.1 ]
Board: t186ref
Ubuntu 16.04 LTS
Kernel Version: 4.4.38

since bus_to_virt is not supported in the 4.9 kernel we have modified the driver as follows.
This is the only driver change. Could this be the reason for Unhandled context fault?

            retval = remap_pfn_range(vma, mmap_start,

                          PFN_DOWN(virt_to_phys(bus_to_virt(

                          mem_tmp->buf_list.pa_buffers[i].paddr))) +

```
                          PFN_DOWN(
```

                          mem_tmp->buf_list.pa_buffers[i].paddr) +
                          mmap_pgoff,
                          mem_tmp->buf_list.pa_buffers[i].bytes,
                          vma->vm_page_prot);

Can you please check what we are possibly doing wrong?

vidyas · September 11, 2019, 9:32am

latest versions of Jetpack has SMMU/IOMMU enabled for PCIe which means, bus address and physical address are different (precisely). So, please use DMA-APIs (refer kernel documentation for more info) and modify your driver accordingly to work with latest releases. Any upstreamed driver should give enough info on how to allocate buffers and map them to be made accessible by a PCIe endpoint device.

dawnpaul · September 13, 2019, 5:28am

Thank you for the response.
We have some follow up questions on this.
We are not currently using the remap_pfn_range function mentioned in the previous message.
And we found that pci_map_sg directly calls dma_map_sg.

Our requirement is to allocate memory in user space, for example to be used for video frame.
Then we want to use this to be moved to / from PCIe device by the DMA in PCIe device.
So our aim is to make this accessible via IOMMU.
We are using streaming DMA with scatter gather list.

Our function call sequence is as below

    /* allocates sg table */
    sg_alloc_table(sgt, pages_nr, GFP_KERNEL));

    /* pins the pages */
    rv = get_user_pages_fast((unsigned long)buf, pages_nr, 1/* write */,
                    pages);

    g = sgt->sgl;
    for (i = 0; i < pages_nr; i++, sg = sg_next(sg)) {
            unsigned int offset = offset_in_page(buf);
            unsigned int nbytes = min_t(unsigned int, PAGE_SIZE -
                                             offset, len);

            flush_dcache_page(pages[i]);
            sg_set_page(sg, pages[i], nbytes, offset);

            buf += nbytes;
            len -= nbytes;
    }

    nents = pci_map_sg(pdev, sg, sgt->orig_nents, dir); // directly calls dma_map_sg internally

    for (i = 0, sg = sgt->sgl;  i < sgt->nents; i++, sg = sg_next(sg)) {
            paddr[i] = sg_dma_address(sg);
            bytes[i]= sg_dma_len(sg);
           
    }

Will the paddr be iommu accessible address?
why our iova falls outside the expected range (0x8000_0000 ~ 0xFFF0_0000) which was working correct for us in In L4t 28.2.1
We see some difference in dtsi related to pcie iommu configuration from the 28.2.1 version. Can you please give us some idea what this change is?

vidyas · September 15, 2019, 11:11am

I hope you are using ‘paddr’ only after making sure that its corresponding sg_dma_len() is a non-zero value.
Also, how is ‘pdev’ which is passed to pci_map_sg() API obtained? Is it the one passed to the API registered for .probe() in of ‘struct pci_driver’? or is it obtained through pci_get_device() API? (BTW, later shouldn’t be used and it doesn’t work)

dawnpaul · September 16, 2019, 5:57am

Hi Vidyas,

We can confirm that value returned by sg_dma_len() is a non-zero value and pdev passed to pci_map_sg() is same as the one passed to the API registered for .probe().

Based on this do you have any suggestions for us? Would it be possible to look at our questions above?

vidyas · September 16, 2019, 10:26am

This needs further debugging as to why the mapped address is not coming from the SMMU pool region (of PCIe).
Do you have any test case by which this issue can be reproduced with any generic PCIe endpoint cards? for example, with an NVMe card or a USB3.0 add-on card etc??

dawnpaul · September 16, 2019, 11:07am

Hi Vidyas,

Unfortunately we do not have any of these generic cards with us. Do you have any suggestion to dump some additional logs? If possible you can share some patch for additional prints and then we can share you the logs.

remko.lems · November 12, 2019, 2:01pm

I do have the same errors popping up.

Although I use the all new Google Coral TPU M.2 Accelerator along with the Jetson Nano. Great combo! See Performance comparison : Coral Edge TPU vs Jetson Nano | Raccoons

Solution:

Source file/Getting started guide Coral TPU: Get started with the M.2 or Mini PCIe Accelerator | Coral

…
If your device includes U-Boot, see the previous HIB error for an example of how to modify the kernel commands. For certain other devices, you might instead add pcie_aspm=off to an APPEND line in your system /boot/extlinux/extlinux.conf file:

LABEL primary
      MENU LABEL primary kernel
      LINUX /boot/Image
      INITRD /boot/initrd
      APPEND ${cbootargs} quiet <b>pcie_aspm=off</b> gasket.dma_bit_mask=32 swiotlb=65536

Topic		Replies	Views
Ubuntu PCIe driver port to L4T gets unhandled context fault Jetson TX2	2	1118	October 18, 2021
iommu unhandled context fault on PCI device DMA Jetson TX2	21	9310	October 18, 2021
IOMMU: Unhandled context fault Jetson TX2	8	3130	January 24, 2020
PCIe iommu fault with bcm5719 Jetson TX2	7	1566	October 17, 2018
iommu unhandled context fault on reserved memory Jetson TX2	3	1882	October 18, 2021
What is the current status of PCIe DMA? Jetson TX1	21	8076	October 18, 2021
Altera FPGA DMA to TX2 via PCIe problem Jetson TX2	18	3589	October 18, 2021
PCie SMMU issue: Unhandled context fault Jetson Orin NX pcie , wifi	5	980	June 28, 2023
PCIe IOMMU Error Jetson TX2	3	1357	April 14, 2019
Network Issues Related to Page Faults Intel 82574L PCIe NIC Jetson TX1	11	1616	October 18, 2021

pcie driver fails after moving to latest l4t

Related topics