Tegra Tx2 kernel crash

Hi ,

We are running Tx2 with pcie transfer overnight . And after 10 hours we are seeing a kernel crash as pasted below. We would like to know if this a known issue if MMC is enabled and not in use.
2) Do we need to disable it
3) what happens if we keep it enabled ?

[ 306.146722] EXT4-fs (mmcblk0p1): error count since last fsck: 2
[ 306.152718] EXT4-fs (mmcblk0p1): initial error at time 1557581302: ext4_journal_check_start:56
[ 306.161383] EXT4-fs (mmcblk0p1): last error at time 1557581302: ext4_journal_check_start:56
[31585.232190] ------------[ cut here ]------------
[31585.236810] Kernel BUG at ffffffc00032ce70 [verbose debug info unavailable]
[31585.243761] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[31585.249238] Modules linked in: shikra_ntb_client(O) shikra_ntb(O) fuse ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack nf_nat br_netfilter overlay pci_tegra bluedroid_pm
[31585.270074] CPU: 3 PID: 235 Comm: mmcqd/2 Tainted: G O 4.4.38-shikra-3.00-00 #1
[31585.278497] Hardware name: quill (DT)
[31585.282152] task: ffffffc1ea00f080 ti: ffffffc1ea1f4000 task.ti: ffffffc1ea1f4000
[31585.289627] PC is at blk_rq_map_sg+0x2b8/0x3f8
[31585.294062] LR is at blk_rq_map_sg+0x1bc/0x3f8
[31585.298497] pc : [] lr : [] pstate: 00000045
[31585.305879] sp : ffffffc1ea1f7b90
[31585.309199] x29: ffffffc1ea1f7b90 x28: ffffffc1b03b6d00
[31585.314535] x27: 00000000000000f3 x26: 0000000000001000
[31585.319854] x25: ffffffc1eb62ce60 x24: 0000000000000073
[31585.325176] x23: ffffffc1ea1f0a10 x22: ffffffc1ea1e8000
[31585.330496] x21: 00000000000000f3 x20: 0000000000000000
[31585.335818] x19: 000000000000d000 x18: 000000000eb4616b
[31585.341140] x17: 000000000003e9b0 x16: 0000000000000000
[31585.346463] x15: 000000000034b336 x14: 0000000000000400
[31585.351791] x13: 00000000002f0236 x12: 0000000000000000
[31585.357119] x11: 00000001736c2000 x10: ffffffbdc57809c1
[31585.362442] x9 : 0000000000000000 x8 : ffffffc070242cf8
[31585.367761] x7 : 0000000000000000 x6 : 0000000000000000
[31585.373086] x5 : 00000001736c1000 x4 : 00000000001736c1
[31585.378428] x3 : 0000000000000000 x2 : ffffffc1eb62c000
[31585.383748] x1 : ffffffbdc5f03cc0 x0 : 0000000000000000

[31585.390557] Process mmcqd/2 (pid: 235, stack limit = 0xffffffc1ea1f4020)
[31585.397243] Call trace:
[31585.399687] [] blk_rq_map_sg+0x2b8/0x3f8
[31585.405168] [] mmc_queue_map_sg+0xdc/0xe8
[31585.410742] [] mmc_blk_rw_rq_prep+0x208/0x3b8
[31585.416660] [] mmc_blk_issue_rw_rq+0x320/0x968
[31585.422661] [] mmc_blk_issue_rq+0x1d8/0x508
[31585.428402] [] mmc_queue_thread+0xcc/0x1a0
[31585.434056] [] kthread+0xdc/0xf0
[31585.438840] [] ret_from_fork+0x10/0x40
[31585.444535] —[ end trace d45b6c39b7171353 ]—
[31585.450679] ------------[ cut here ]------------
[31585.455292] WARNING: at ffffffc0000a86b0 [verbose debug info unavailable]
[31585.462069] Modules linked in: shikra_ntb_client(O) shikra_ntb(O) fuse ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack nf_nat br_netfilter overlay pci_tegra bluedroid_pm

[31585.484451] CPU: 3 PID: 235 Comm: mmcqd/2 Tainted: G D O 4.4.38-shikra-3.00-00 #1
[31585.492875] Hardware name: quill (DT)
[31585.496535] task: ffffffc1ea00f080 ti: ffffffc1ea1f4000 task.ti: ffffffc1ea1f4000
[31585.504015] PC is at __local_bh_enable_ip+0x68/0xc0
[31585.508892] LR is at _raw_spin_unlock_bh+0x20/0x28
[31585.513677] pc : [] lr : [] pstate: 400003c5
[31585.521061] sp : ffffffc1ea1f7840
[31585.524369] x29: ffffffc1ea1f7840 x28: ffffffc1ea1f4000
[31585.529703] x27: 00000000000000f3 x26: 0000000000001000
[31585.535034] x25: ffffffc1ea00f080 x24: 00000000000003c0
[31585.540364] x23: 0000000000000000 x22: ffffffc001286000
[31585.545696] x21: ffffffc1ea00f080 x20: ffffffc0012a53c0
[31585.551027] x19: ffffffc00148d760 x18: 0000000000000000
[31585.556357] x17: 000000000003e9b0 x16: 0000000000000000
[31585.561689] x15: 0000000000000000 x14: ffffffc079ccde99
[31585.567018] x13: ffffffc079ccde95 x12: 0000000000000000
[31585.572347] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f
[31585.577676] x9 : fefefefefefefeff x8 : 7f7f7f7f7f7f7f7f
[31585.583007] x7 : 09432c3838303634 x6 : ffffffc1ea00f8a0
[31585.588339] x5 : 000000000000003c x4 : 0000000000000000
[31585.593666] x3 : 0000000000000000 x2 : 0000000000000000
[31585.598996] x1 : 0000000000000201 x0 : ffffffc001409a6e

[31585.606246] —[ end trace d45b6c39b7171354 ]—
[31585.610856] Call trace:
[31585.613303] [] __local_bh_enable_ip+0x68/0xc0
[31585.619215] [] _raw_spin_unlock_bh+0x20/0x28
[31585.625040] [] cgroup_exit+0x54/0xe0
[31585.630170] [] do_exit+0x2c4/0x9e0
[31585.635131] [] bug_handler.part.1+0x0/0x80
[31585.640783] [] bug_handler.part.1+0x40/0x80
[31585.646523] [] bug_handler+0x24/0x30
[31585.651654] [] brk_handler+0x88/0xd0
[31585.656784] [] do_debug_exception+0x3c/0xa0
[31585.662522] [] el1_dbg+0x18/0x74
[31585.667309] [] mmc_queue_map_sg+0xdc/0xe8
[31585.672873] [] mmc_blk_rw_rq_prep+0x208/0x3b8
[31585.678783] [] mmc_blk_issue_rw_rq+0x320/0x968
[31585.684779] [] mmc_blk_issue_rq+0x1d8/0x508
[31585.690518] [] mmc_queue_thread+0xcc/0x1a0
[31585.696171] [] kthread+0xdc/0xf0
[31585.700957] [] ret_from_fork+0x10/0x40
pointr@tegra-ubuntu:~/SG/PCIeTesting/Sparcle/tools$

The error is from a corrupt file system. In some form the system was not correctly shut down. That could be due to a crash, or it could be loss of power, or just shutting it down via holding the power button.

Normally a journal replay will keep the file system from being corrupt, but this is at the cost of removing some of the file system content. If that content was related to something important, then you might be getting crashes due to this.

The non-rootfs content is purely binary data and won’t have this issue. It is also possible the system will fsck the eMMC ext4 file systems even if they are not used. I could see the possibility that if some extreme corruption were encountered perhaps this would result in a kernel panic since it is a bad idea to boot a system with corruption. So that leaves either the possibility that the fsck has resulted in a panic, or the content removed via journal replay has resulted in panic. There is actually a third option, that the fsck has nothing to do with it, but there could be an eMMC hardware error or failure.

An important detail to know is if this was flashed originally to run purely eMMC, and then what modifications were made to get whatever your current boot configuration is. Was the kernel modified? Is any other carrier board used? Was a device tree changed? If the default boot device is no longer eMMC, how was it accomplished?

  1. An important detail to know is if this was flashed originally to run purely eMMC, and then what modifications were made to get whatever your current boot configuration is

Ans :Sorry there was a misunderstanding that Emmc is not used . In fact we are using Jetson Tx2 module with 32GB emmc which is provided originally with the module. So the rootfs on jetpack 3.3 installation sits in this Emmc only and no boot configuration is modified as provided in Jetpack 3.3.

  1. Was the kernel modified?
    Yes the kernel 28.2.1 was modified . We added few patches to support pcie. i have sent you the patches as private message.

  2. Is any other carrier board used?
    Jetson module is connected to one of the port to the IDT switch.

  3. Was a device tree changed?
    Yes the device tree changed ( disabled few devices not used, Disabled smmu for pcie controller , Reserved 512MB ram )

  4. if the default boot device is no longer eMMC, how was it accomplished?
    The default boot device is only Emmc and its not modified .

Someone from NVIDIA will have to answer about the PCIe customization, but at the end of this post is a device tree edit requote for others to see the device tree part of patching. What I will add is how to repair the filesystem so you can examine logs and perhaps boot again (and if the error makes the system unbootable again, then you will have the backup to quick flash and not have to repeat the backup). I am assuming you have put work into creating this filesystem and want to repair it rather than starting from scratch.

With both a filesystem and non-filesystem error this will be hard to debug. It isn’t known if the crash and subsequent filesystem corruption is causing the kernel messages. So is it correct to say this worked for some time (you said after 10 hours of copy operation…presumably over PCIe) before it failed? If so, then the reboot issue is likely from filesystem corruption and not from the bug which caused the corruption. I am going to suggest repairing the file system first.

To repair the filesystem what you will need to do is clone the rootfs and then repair the clone on loopback on the host PC. This can then be saved as a reference copy and reflashed to the Jetson at any time to restore the repaired version. Since you also have a custom device tree you will need to make sure that is also still in place during any flash which restores the repaired rootfs (because unless you’ve done this and verified only the rootfs is put back you should assume the need to have your device tree in place during the restore).

If you were to flash only the device tree (such as if you’ve restored the rootfs and found it didn’t put the right device tree in place), then make sure this flash.sh patch is in place:
https://devtalk.nvidia.com/default/topic/1036286/jetson-tx1/flashing-just-dtb-on-28-2-and-tx1/post/5264465/#5264465

For R28.2.1 clone these are the steps:

# Put Jetson in recovery mode, connect micro-B USB.
# Verify on the host that the Jetson is seen:
lsusb -d 0955:7c18
sudo ./flash.sh -r -k APP -G backup.img jetson-tx2 mmcblk0p1

Once you have cloned you will have both a non-editable sparse image, “backup.img”, and also a full image which can be used for flash, edited, repaired, so on. This will be file “backup.img.raw”. It is this larger raw image we need to repair the filesystem, and the smaller sparse image can be discarded (you can restore via either raw or sparse image so long as it is named “system.img” and in the “Linux_for_Tegra/bootloader/” directory, but a larger raw image takes longer).

Unfortunately this backup.img.raw is the same size as the entire partition, and so your host is going to be working with a roughly 28GB file (very slow to copy unless you have an SSD) and you may want both an umodified reference copy and the copy you are working on…if you know the working copy has been successfully repaired, then you could get rid of the other copy. You will then reflash the Jetson using the repaired copy. During that flash with the repaired copy you will want to make sure your edited device tree is in place, although you can give arguments to flash just the rootfs and not modify the dtb (in theory you wouldn’t need to worry about the device tree if flashing only the rootfs, but I won’t guarantee something won’t go wrong and inadvertently put the original back in place).

To perform a filesystem repair on backup.img.raw you will first cover it with loopback. All of this must be done as root, and so start by using “sudo -s” to drop into a root shell. Start as follows:

losetup --find --show ./backup.img.raw

Unless you are using other loop devices it will use “/dev/loop0”. I will assume this, but adjust if it is different for you.

Now fsck:

fsck.ext4 /dev/loop0

One of three things can happen. The first is that there was no error and fsck will just complete. In your case this won’t happen. The second possibility is that repairs will be made and it is done…this is what I hope for. The third possibility, and one which might occur since the system panicked after the journal replay, is that you’ll have to manually go through various steps. The steps will remove part of the file system and put the lost components in the “lost+found/” subdirectory (this is the subdirectory of the image, not the host PC). In this latter case you can try to put the clone back in and boot, but you might find the missing components are an issue. I will assume you can put the image back in place. If you have issues with fsck you can always ask more questions.

So now you can mount and examine the repaired loopback device (still using sudo):

mount /dev/loop0 /mnt
cd /mnt
# Examine or view the filesystem...then remove it from loopback.
ls
cd
umount /dev/loop0
losetup -d /dev/loop0

Now place a copy in the flash location:

cp backup.img.raw /where/ever/it/is/Linux_for_Tegra/bootloader/system.img

Follow the previously mentioned steps to restore the repaired clone into the Jetson (restating those instructions here):

# Put Jetson in recovery mode, connect micro-B USB.
# Verify on the host that the Jetson is seen:
lsusb -d 0955:7c18
sudo ./flash.sh -r -k APP -G backup.img jetson-tx2 mmcblk0p1

Reboot, see if it finishes boot. If boot works, then reboot it again at least once before starting work…see if the filesystem corrupts again or remains repaired. If repaired, then you are good to go.

I suggest that once it boots again you should run it with serial console during all testing. Set your serial console to log. I like gtkterm, in part due to logging. If your serial USB cable shows up on the host as “/dev/ttyUSB0”, then gtkterm would launch like this:

gtkterm -b 8 -t 1 -s 115200 -p /dev/ttyUSB0

Run your copy again until it fails, and post the last part (or any relevant portion) of the log. Prior to starting run “sudo lspci -vvv” to get a verbose log. If gtkterm is logging, then results will be recorded in the log. If you get a failure and the system can still respond to serial console, then run “sudo lspci -vvv” again, and run “dmesg” to get dmesg logs. All will be recorded in the serial console session if you set up logging.

When you get information to post you might add what the kernel modifications are, and also if you having any details on the copy test which might make this more repeatable. Details of what it is the Jetson is acting as an end point for (or vice versa) would help.


Note: If you place logs or details inside the code icon in the upper right (looks like “</>”), then scrollbars are added and formatting is preserved.

If you hover you mouse over the upper right corner quote icon of one of your existing posts, then other icons will show up. The paper clip icon will allow attaching some file types to an existing post, e.g., a log file with a “.txt” extension.


Device Tree Edits (this is from a private message…the dts is truncated there so it is truncated here as well…file attachments allow more content if you want to attach a tree named with “.txt” extension):
…forum doesn’t like my repaste, attaching your dts instead…
dts.txt (51.7 KB)

If our system is booting properly after reset then i think we need not repair the file system.
I see that this is runtime crash happening and after reboot its working fine.

Hi Kalpana,

Answers inline:

  1. Do we need to disable it
    ==> Since the root filesystem is on EMMC, you can not disable it. Though your test case is not using it directly, its getting used in background for some or the other kernel work.

  2. what happens if we keep it enabled ?
    ==> The crash you have posted is not a known issue. Can you share the complete log?
    I am also inclined it to be a HW stability. Is the power supply you are using stable?
    Is the crash happening with the same signature always?
    Can you run ./jetson_clocks.sh from you home path and then carry on your test?

thanks
Bibek

The crash you have posted is not a known issue. Can you share the complete log?
Ans=>We don’t have the full log. When it crashes next time we will take a full log. it takes long hours to reproduce.

I am also inclined it to be a HW stability. Is the power supply you are using stable?
Ans=>Yes the Power supply is stable

Is the crash happening with the same signature always?
Ans => Yes the crash is happening with the same signature .It happened 3 times so far and it takes long hours to reproduce.

Can you run ./jetson_clocks.sh from you home path and then carry on your test?
Ans=> We are doing this already.

  1. Are you using a custom EMMC?
  2. If your eMMC part supports background operations, please try disabling that and check if there is any improvement.
    diff --git a/drivers/mmc/host/sdhci-tegra.c b/drivers/mmc/host/sdhci-tegra.c
    index fdf27fa…91994ba 100644
    — a/drivers/mmc/host/sdhci-tegra.c
    +++ b/drivers/mmc/host/sdhci-tegra.c
    @@ -962,7 +962,7 @@ static int tegra_sdhci_pltfm_init(struct sdhci_host *host,
    if (plat->is_8bit)
    host->mmc->caps |= MMC_CAP_8_BIT_DATA;
    host->mmc->caps |= MMC_CAP_SDIO_IRQ;
  • host->mmc->caps |= MMC_CAP_BKOPS;
  • //host->mmc->caps |= MMC_CAP_BKOPS;

host->mmc->pm_caps = MMC_PM_KEEP_POWER | MMC_PM_IGNORE_PM_NOTIFY;
if (plat->mmc_data.built_in) {

We are using inbuilt eMMC which is part of Jetson Tx2 module.

We have got crash yesterday night. Please find the attached log
KernelCrash_after_23_Hours_dmesg_with_time_log.txt (98.8 KB)

Can you tell me what all IOs are connected to the Target?
And what all drivers you have enabled on top of default kernel.
Also can you apply the change I mentioned in the earlier comment and run the test

  1. Can you tell me what all IOs are connected to the Target?
    we have connected PCIe IDT switch, ethernet switch, and few other i2c Slave devices.

2)And what all drivers you have enabled on top of default kernel.

On top of default kernel we have added driver for IDT switch to do pcie transfer through the IDT switch NTB functionality.
We removed drivers which we are not using.Attached the device tree in the previous conversation.

3)Also can you apply the change I mentioned in the earlier comment and run the test

Ok. I will,and leave it for overnight testing .
But we have seen in one of the devtalk that the background check will happen every 24 hours .But in the crash log we see that after 11 hours of booting the system is crashed and in the crash log " tainted G O" is printed,which is out of tree module. But eMMC driver is not out of tree module.If we are correct then i think commenting background check will not help.

  1. In L4T28.2.1 public release for Jetson Tx2 in /drivers/mmc/host/sdhci-tegra.c we don’t see the lines you have mentioned at line number 962. Are you referring to the same L4T version?

But in dmesg we see that it is HS400 MMC card and its 32GB. MAN_BKOPS_EN bit is not set.

mmc0: SDHCI controller on 3400000.sdhci [3400000.sdhci] using ADMA 64-bit with 64 bit addr
[    4.719023] mmc1: SDHCI controller on 3440000.sdhci [3440000.sdhci] using ADMA 64-bit with 64 bit addr
[    4.763001] mmc2: SDHCI controller on 3460000.sdhci [3460000.sdhci] using ADMA 64-bit with 64 bit addr
[    4.763243] Waiting for root device /dev/mmcblk0p1...
[    4.937019] mmc2: MAN_BKOPS_EN bit is not set
[    4.949233] mmc2: Skipping tuning since strobe enabled
[    4.961369] mmc2: periodic cache flush enabled
[    4.965822] mmc2: new HS400 MMC card at address 0001
[    4.971124] mmcblk0: mmc2:0001 032G34 29.1 GiB 
[    4.975891] mmcblk0boot0: mmc2:0001 032G34 partition 1 4.00 MiB
[    4.982015] mmcblk0boot1: mmc2:0001 032G34 partition 2 4.00 MiB
[    4.988121] mmcblk0rpmb: mmc2:0001 032G34 partition 3 4.00 MiB
[    4.996280]  mmcblk0: p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p13 p14 p15 p16 p17 p18 p19 p20 p21 p22 p23 p24 p25 p26 p27 p28 p29
[    5.198666] EXT4-fs (mmcblk0p1): mounted filesystem with ordered data mode. Opts: (null)
[    5.810891] EXT4-fs (mmcblk0p1): re-mounted. Opts: (null)

I see following errors in the log and they seem to be occurring during boot itself. Can you please confirm that?
If yes, do you get these errors even if you disable NTB driver for the switch?

tegra-pcie 10003000.pcie-controller: PCIE: Response decoding error, signature: 10018835

I see that this error is coming only after we load the NTB driver. Currently we are loading the module using “insmod” as we are in development phase.

What about the time of the error? I mean does it come as soon as the driver is insmod’ed or just before the crash?

its just happening immediately after we load the driver during initialization of the switch only. Before the crash i see no such errors.

I think we need to understand why those errors are coming? Can you please root cause which code in the driver is causing these errors? Is the driver of the switch available with the released version of kernel code or is it some proprietary driver?
Also, what is the version of the release being used here?

We are using our proprietary driver. We are using L4T28.2.1. version.

we made few changes to disable smmu for pcie and few other devices not in use. See comment #4 dts.txt.

I will debug and find out which piece of code is causing this issue and let you know

Meanwhile, in your code base, can you please check if pci-tegra.c file has the following code?

1552         /* Finally enable PCIe */
1553         val = afi_readl(pcie, AFI_CONFIGURATION);
1554         val |=  (AFI_CONFIGURATION_EN_FPCI |
1555                         AFI_CONFIGURATION_CLKEN_OVERRIDE);
1556         afi_writel(pcie, val, AFI_CONFIGURATION);

Particularly “AFI_CONFIGURATION_CLKEN_OVERRIDE” part.