Tegra Tx2 kernel crash

I already checked, Its there in our code base . CLKEN_OVERRIDE is set.

Earlier the driver which we are using was running on Xgene. We started using it for Tegra and i found in this code a fix written for Xgene for ARI and not required for tegra which is creating this error .After removing that piece of code i see no more error.

Do you think that the crash is related to this error ?

Well, it depends on what kind of fix is present for Xgene, because, as such Kernel-4.4 does have support for ARI configuration if the endpoint supports ARI. If the fix is specific to root port (i.e. Xgene) and if we are applying it to Tegra without really checking what is the root port, then, it is possible that it might be causing this issue. Also, another interesting thing is that the issue occurs some time after the use case starts running instead of immediately. I would expect the issue to occur immediately if there is some issue with ARI. Can you please provide more details about the ARI fix disabling of which is bringing the stability here?

This is the piece of code whichh was under XGENE.

static int add_fake_pci_devices(idt_g2_ntb_local_ep *local_ep)
{
	struct pci_dev *pdev = local_ep->pdev;
	struct pci_dev *fake_pdev;
	struct pci_bus *ntb_bus = pdev->bus;
	int ntb_bus_num;
	int ntb_dev_num;
	int ntb_func_num;
	int fake_devfn;
	int i = 0;
	int entry_idx[] = {8, 16, 24, 32, 40, 48, 56};

	for (i = 0; i < MAX_PEERS - 1; i++) {
		fake_pdev = pci_alloc_dev(ntb_bus);
		if (!fake_pdev) {
			LOG_ERR("Failed to create fake pci dev\n");
			return -1;
		}
		local_ep->fake_pdev[i] = fake_pdev;

		ntb_bus_num = PCI_BUS_NUM(pdev->devfn);
		ntb_dev_num = PCI_SLOT(pdev->devfn);
		ntb_func_num = PCI_FUNC(pdev->devfn);
		/* fake device has
		 * bus_num: same as NTB device
		 * dev: b10 + 3 upper bits of entry number
		 * func: 3 lower bits of entry number
		 */
		fake_devfn = (0x2 << 6) | (entry_idx[i] & 0x3f);
		fake_devfn = PCI_DEVID(ntb_bus_num, fake_devfn);

		fake_pdev->devfn = fake_devfn;
		fake_pdev->vendor = pdev->vendor;
		fake_pdev->device = pdev->device;

		pci_set_of_node(fake_pdev);

		/* copy real data from real ntb device */
		fake_pdev->sysdata = fake_pdev->bus->sysdata;
		fake_pdev->dev.parent = fake_pdev->bus->bridge;
		fake_pdev->dev.bus = &pci_bus_type;
		fake_pdev->hdr_type = pdev->hdr_type;
		fake_pdev->multifunction = pdev->multifunction;
		fake_pdev->error_state = pci_channel_io_normal;
		/* skip capability setup */
		pci_dev_assign_slot(fake_pdev);
		fake_pdev->dma_mask = 0xffffffff;
		dev_set_name(&fake_pdev->dev, "%04x:%02x:%02x.%d",
			     pci_domain_nr(fake_pdev->bus),
			     fake_pdev->bus->number,
			     PCI_SLOT(fake_pdev->devfn),
			     PCI_FUNC(fake_pdev->devfn));
		fake_pdev->revision = pdev->revision;
		fake_pdev->class = pdev->class;
		dev_dbg(&fake_pdev->dev, " [%04x:%04x] type %02x class %#08x\n",
				fake_pdev->vendor, fake_pdev->device,
				fake_pdev->hdr_type, fake_pdev->class);
		fake_pdev->cfg_size = pdev->cfg_size;
		fake_pdev->current_state = pdev->current_state;
		fake_pdev->subsystem_vendor = pdev->subsystem_vendor;
		fake_pdev->subsystem_device = pdev->subsystem_device;

		pci_device_add(fake_pdev, ntb_bus);
	}
	return 0;
}
pci_write_config_byte(pdev, XGENE_ARI_REG, 0x20);
		pci_read_config_byte(pdev, XGENE_ARI_REG, &val8);
		LOG_INFO("Read back ARI: 0x%02x\n", val8);

Based on my understanding, it looks like this code is adding face PCIe devices (in SW) to PCIe sub-system. I’m not clear on what is this got to do with ARI though. Are we doing it so that the endpoint gets a specific B:D:F assigned by PCIe sub-system? If so, doing it this way would be very nasty.
Anyway, now that things are working fine with the removal of ARI fix, can this thread considered for closure?

There are 2 functions called under this version check

#if ((LINUX_VERSION_CODE != KERNEL_VERSION(3, 10, 96)) &&
(LINUX_VERSION_CODE != KERNEL_VERSION(4, 4, 15)))

one was add_fake_pci_devices()
and another one function for Xgene to enable ARI. Even iam trying to understand why this fake device is enabled. I commented both.Now i have to make a long run to check the crash issue atleast for 36 hours. After that i will update you to close this.

Yesterday it crashed in 3.45 hours.Still we need to debug why?

I think so. Let us stick to the plan in #19 i.e. finding out what exactly is causing the PCIe errors to pop up in the log.

The error we were seeing is not coming now after we removed that 2 functions. PFA kernel crash log
kernelcrash-28thmay.txt (93.5 KB)

Thanks for the log.
It now looks like the error is not related to PCIe. Did you happen to observe this crash without running PCIe (data transfers over NT)? Also, If the source or destination of the data transfers is eMMC by any chance, can you please change it to something else? like connecting a SATA disk or a USB pen drive and using that data as source/destination?

I suspect SW is somehow corrupting the kernel data structures.
Can you run the test with KASAN enabled?

Hi BBasu,

In config file below if i enable the below2 flags will do or i have to enable anything along with it.

CONFIG_HAVE_ARCH_KASAN=y
CONFIG_KASAN=Y

Just these two
CONFIG_KASAN=y
CONFIG_FRAME_WARN=0

Do you have any update after using KASAN?