Tegra Tx2 kernel crash

kalpana.jeevraj · May 27, 2019, 10:56am

I already checked, Its there in our code base . CLKEN_OVERRIDE is set.

kalpana.jeevraj · May 27, 2019, 12:35pm

Earlier the driver which we are using was running on Xgene. We started using it for Tegra and i found in this code a fix written for Xgene for ARI and not required for tegra which is creating this error .After removing that piece of code i see no more error.

Do you think that the crash is related to this error ?

vidyas · May 27, 2019, 2:00pm

Well, it depends on what kind of fix is present for Xgene, because, as such Kernel-4.4 does have support for ARI configuration if the endpoint supports ARI. If the fix is specific to root port (i.e. Xgene) and if we are applying it to Tegra without really checking what is the root port, then, it is possible that it might be causing this issue. Also, another interesting thing is that the issue occurs some time after the use case starts running instead of immediately. I would expect the issue to occur immediately if there is some issue with ARI. Can you please provide more details about the ARI fix disabling of which is bringing the stability here?

kalpana.jeevraj · May 28, 2019, 6:09am

This is the piece of code whichh was under XGENE.

static int add_fake_pci_devices(idt_g2_ntb_local_ep *local_ep)
{
	struct pci_dev *pdev = local_ep->pdev;
	struct pci_dev *fake_pdev;
	struct pci_bus *ntb_bus = pdev->bus;
	int ntb_bus_num;
	int ntb_dev_num;
	int ntb_func_num;
	int fake_devfn;
	int i = 0;
	int entry_idx[] = {8, 16, 24, 32, 40, 48, 56};

	for (i = 0; i < MAX_PEERS - 1; i++) {
		fake_pdev = pci_alloc_dev(ntb_bus);
		if (!fake_pdev) {
			LOG_ERR("Failed to create fake pci dev\n");
			return -1;
		}
		local_ep->fake_pdev[i] = fake_pdev;

		ntb_bus_num = PCI_BUS_NUM(pdev->devfn);
		ntb_dev_num = PCI_SLOT(pdev->devfn);
		ntb_func_num = PCI_FUNC(pdev->devfn);
		/* fake device has
		 * bus_num: same as NTB device
		 * dev: b10 + 3 upper bits of entry number
		 * func: 3 lower bits of entry number
		 */
		fake_devfn = (0x2 << 6) | (entry_idx[i] & 0x3f);
		fake_devfn = PCI_DEVID(ntb_bus_num, fake_devfn);

		fake_pdev->devfn = fake_devfn;
		fake_pdev->vendor = pdev->vendor;
		fake_pdev->device = pdev->device;

		pci_set_of_node(fake_pdev);

		/* copy real data from real ntb device */
		fake_pdev->sysdata = fake_pdev->bus->sysdata;
		fake_pdev->dev.parent = fake_pdev->bus->bridge;
		fake_pdev->dev.bus = &pci_bus_type;
		fake_pdev->hdr_type = pdev->hdr_type;
		fake_pdev->multifunction = pdev->multifunction;
		fake_pdev->error_state = pci_channel_io_normal;
		/* skip capability setup */
		pci_dev_assign_slot(fake_pdev);
		fake_pdev->dma_mask = 0xffffffff;
		dev_set_name(&fake_pdev->dev, "%04x:%02x:%02x.%d",
			     pci_domain_nr(fake_pdev->bus),
			     fake_pdev->bus->number,
			     PCI_SLOT(fake_pdev->devfn),
			     PCI_FUNC(fake_pdev->devfn));
		fake_pdev->revision = pdev->revision;
		fake_pdev->class = pdev->class;
		dev_dbg(&fake_pdev->dev, " [%04x:%04x] type %02x class %#08x\n",
				fake_pdev->vendor, fake_pdev->device,
				fake_pdev->hdr_type, fake_pdev->class);
		fake_pdev->cfg_size = pdev->cfg_size;
		fake_pdev->current_state = pdev->current_state;
		fake_pdev->subsystem_vendor = pdev->subsystem_vendor;
		fake_pdev->subsystem_device = pdev->subsystem_device;

		pci_device_add(fake_pdev, ntb_bus);
	}
	return 0;
}

pci_write_config_byte(pdev, XGENE_ARI_REG, 0x20);
		pci_read_config_byte(pdev, XGENE_ARI_REG, &val8);
		LOG_INFO("Read back ARI: 0x%02x\n", val8);

vidyas · May 28, 2019, 6:36am

Based on my understanding, it looks like this code is adding face PCIe devices (in SW) to PCIe sub-system. I’m not clear on what is this got to do with ARI though. Are we doing it so that the endpoint gets a specific B:D:F assigned by PCIe sub-system? If so, doing it this way would be very nasty.
Anyway, now that things are working fine with the removal of ARI fix, can this thread considered for closure?

kalpana.jeevraj · May 28, 2019, 7:11am

There are 2 functions called under this version check

#if ((LINUX_VERSION_CODE != KERNEL_VERSION(3, 10, 96)) &&
(LINUX_VERSION_CODE != KERNEL_VERSION(4, 4, 15)))

one was add_fake_pci_devices()
and another one function for Xgene to enable ARI. Even iam trying to understand why this fake device is enabled. I commented both.Now i have to make a long run to check the crash issue atleast for 36 hours. After that i will update you to close this.

kalpana.jeevraj · May 29, 2019, 9:26am

Yesterday it crashed in 3.45 hours.Still we need to debug why?

vidyas · May 29, 2019, 11:21am

I think so. Let us stick to the plan in #19 i.e. finding out what exactly is causing the PCIe errors to pop up in the log.

kalpana.jeevraj · May 29, 2019, 1:49pm

The error we were seeing is not coming now after we removed that 2 functions. PFA kernel crash log
kernelcrash-28thmay.txt (93.5 KB)

vidyas · May 30, 2019, 10:14am

Thanks for the log.
It now looks like the error is not related to PCIe. Did you happen to observe this crash without running PCIe (data transfers over NT)? Also, If the source or destination of the data transfers is eMMC by any chance, can you please change it to something else? like connecting a SATA disk or a USB pen drive and using that data as source/destination?

Bibek · June 3, 2019, 5:46am

I suspect SW is somehow corrupting the kernel data structures.
Can you run the test with KASAN enabled?

kalpana_sethi · June 3, 2019, 5:57am

Hi BBasu,

In config file below if i enable the below2 flags will do or i have to enable anything along with it.

CONFIG_HAVE_ARCH_KASAN=y
CONFIG_KASAN=Y

Bibek · June 3, 2019, 6:40am

Just these two
CONFIG_KASAN=y
CONFIG_FRAME_WARN=0

vidyas · June 11, 2019, 6:46am

Do you have any update after using KASAN?