pcie driver fails after moving to latest l4t

While communicating with a custom PCIe device that we develop we face couple of issues.
The device was working correctly in the earlier version of L4T.
The device basically bus masters data transfer from and to the L4T system memory.
The system memory is allocated using normal malloc in user space and then mapped using pci_map_sg.

Issue#1. AER (PCIe advanced error reporting utility) now reports link layer error.
Issue#2. We get Unhandled context fault from IOMMU

Version of L4T with issue

NVIDIA Jetson TX2
L4T 32.2.1 [ JetPack UNKNOWN ]
Board: t186ref
Ubuntu 18.04.2 LTS
Kernel Version: 4.9.140-tegra

relevant part of dmesg log that shows AER messages and also context fault.

[ +0.006369] pcieport 0000:00:01.0: AER: Multiple Corrected error received: id=0020
[ +0.000107] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0008(Transmitter ID)
[ +0.010532] pcieport 0000:00:01.0: device [10de:10e5] error status/mask=00001100/00002000
[ +0.008379] pcieport 0000:00:01.0: [ 8] RELAY_NUM Rollover
[ +0.006115] pcieport 0000:00:01.0: [12] Replay Timer Timeout
[ +0.006207] camera_ipu 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0100(Transmitter ID)
[ +0.011257] camera_ipu 0000:01:00.0: device [1e53:9024] error status/mask=00001100/0000e000
[ +0.008672] camera_ipu 0000:01:00.0: [ 8] RELAY_NUM Rollover
[ +0.006605] camera_ipu 0000:01:00.0: [12] Replay Timer Timeout
[ +0.006368] pcieport 0000:00:01.0: AER: Multiple Corrected error received: id=0020
[ +0.000203] camera_ipu 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0100(Transmitter ID)
[ +0.010806] camera_ipu 0000:01:00.0: device [1e53:9024] error status/mask=00001100/0000e000
[ +0.008589] camera_ipu 0000:01:00.0: [ 8] RELAY_NUM Rollover
[ +0.006333] arm-smmu 12000000.iommu: Unhandled context fault: iova=0x17ffffc00, fsynr=0x200013, cb=21, sid=17(0x11 - AFI), pgd=22ea9e003, pud=22ea9e003, pmd=23be24003, pte=0
[ +0.000147] mc-err: vpr base=0:0, size=0, ctrl=1, override:(e01a8341, 1dc10c1, 2a800000, 2)
[ +0.000010] mc-err: (255) csw_afiw: MC request violates VPR requirements
[ +0.000006] mc-err: status = 0x00337031; addr = 0x3ffffffc0
[ +0.000005] mc-err: secure: yes, access-type: write
[ +0.000013] mc-err: unknown mcerr fault, int_status=0x00000000, ch_int_status=0x00000200, hubc_int_status=0x00000000
[ +0.000011] mc-err: unknown mcerr fault, int_status=0x00000000, ch_int_status=0x00000200, hubc_int_status=0x00000000
[ +0.000013] mc-err: unknown mcerr fault, int_status=0x00000000, ch_int_status=0x00000200, hubc_int_status=0x00000000
[ +0.000043] mc-err: Too many MC errors; throttling prints

The pcie driver was working correctly in the following l4t version

NVIDIA Jetson TX2
L4T 28.2.1 [ JetPack 3.3 or 3.2.1 ]
Board: t186ref
Ubuntu 16.04 LTS
Kernel Version: 4.4.38

since bus_to_virt is not supported in the 4.9 kernel we have modified the driver as follows.
This is the only driver change. Could this be the reason for Unhandled context fault?

            retval = remap_pfn_range(vma, mmap_start,
  •                           PFN_DOWN(virt_to_phys(bus_to_virt(
    
  •                           mem_tmp->buf_list.pa_buffers[i].paddr))) +
    
  •                           PFN_DOWN(
    
  •                           mem_tmp->buf_list.pa_buffers[i].paddr) +
                              mmap_pgoff,
                              mem_tmp->buf_list.pa_buffers[i].bytes,
                              vma->vm_page_prot);
    

Can you please check what we are possibly doing wrong?

latest versions of Jetpack has SMMU/IOMMU enabled for PCIe which means, bus address and physical address are different (precisely). So, please use DMA-APIs (refer kernel documentation for more info) and modify your driver accordingly to work with latest releases. Any upstreamed driver should give enough info on how to allocate buffers and map them to be made accessible by a PCIe endpoint device.

Thank you for the response.
We have some follow up questions on this.
We are not currently using the remap_pfn_range function mentioned in the previous message.
And we found that pci_map_sg directly calls dma_map_sg.

Our requirement is to allocate memory in user space, for example to be used for video frame.
Then we want to use this to be moved to / from PCIe device by the DMA in PCIe device.
So our aim is to make this accessible via IOMMU.
We are using streaming DMA with scatter gather list.

Our function call sequence is as below

    /* allocates sg table */
    sg_alloc_table(sgt, pages_nr, GFP_KERNEL));

    /* pins the pages */
    rv = get_user_pages_fast((unsigned long)buf, pages_nr, 1/* write */,
                    pages);

    g = sgt->sgl;
    for (i = 0; i < pages_nr; i++, sg = sg_next(sg)) {
            unsigned int offset = offset_in_page(buf);
            unsigned int nbytes = min_t(unsigned int, PAGE_SIZE -
                                             offset, len);

            flush_dcache_page(pages[i]);
            sg_set_page(sg, pages[i], nbytes, offset);

            buf += nbytes;
            len -= nbytes;
    }

    nents = pci_map_sg(pdev, sg, sgt->orig_nents, dir); // directly calls dma_map_sg internally

    for (i = 0, sg = sgt->sgl;  i < sgt->nents; i++, sg = sg_next(sg)) {
            paddr[i] = sg_dma_address(sg);
            bytes[i]= sg_dma_len(sg);
           
    }
  1. Will the paddr be iommu accessible address?
  2. why our iova falls outside the expected range (0x8000_0000 ~ 0xFFF0_0000) which was working correct for us in In L4t 28.2.1
  3. We see some difference in dtsi related to pcie iommu configuration from the 28.2.1 version. Can you please give us some idea what this change is?

I hope you are using ‘paddr’ only after making sure that its corresponding sg_dma_len() is a non-zero value.
Also, how is ‘pdev’ which is passed to pci_map_sg() API obtained? Is it the one passed to the API registered for .probe() in of ‘struct pci_driver’? or is it obtained through pci_get_device() API? (BTW, later shouldn’t be used and it doesn’t work)

Hi Vidyas,

We can confirm that value returned by sg_dma_len() is a non-zero value and pdev passed to pci_map_sg() is same as the one passed to the API registered for .probe().

Based on this do you have any suggestions for us? Would it be possible to look at our questions above?

This needs further debugging as to why the mapped address is not coming from the SMMU pool region (of PCIe).
Do you have any test case by which this issue can be reproduced with any generic PCIe endpoint cards? for example, with an NVMe card or a USB3.0 add-on card etc??

Hi Vidyas,

Unfortunately we do not have any of these generic cards with us. Do you have any suggestion to dump some additional logs? If possible you can share some patch for additional prints and then we can share you the logs.

I do have the same errors popping up.

Although I use the all new Google Coral TPU M.2 Accelerator along with the Jetson Nano. Great combo! See https://blog.raccoons.be/coral-tpu-jetson-nano-performance

Solution:

Source file/Getting started guide Coral TPU: https://coral.withgoogle.com/docs/m2/get-started/


If your device includes U-Boot, see the previous HIB error for an example of how to modify the kernel commands. For certain other devices, you might instead add pcie_aspm=off to an APPEND line in your system /boot/extlinux/extlinux.conf file:

LABEL primary
      MENU LABEL primary kernel
      LINUX /boot/Image
      INITRD /boot/initrd
      APPEND ${cbootargs} quiet <b>pcie_aspm=off</b> gasket.dma_bit_mask=32 swiotlb=65536