We are trying to isolate sporadic problems with the NVMe drive in our Orin NX system. For the most, the drives are working well enough, but we have had a few end up with severe filesystem errors that require manual fsck.
Today, I started playing around with different stress tests in an attempt to isolate the problems. If I run stress-ng --fpunch 0
, the R35.6.0 kernel (5.10.216) will crash almost instantly, giving tracebacks that look like the following:
[88432.538795] WARNING: CPU: 5 PID: 56020 at /fs/ext4/inode.c:3335 ext4_journalled_invalidatepage+0x4c/0x60
[88432.539111] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink xt_addrtype br_netfilter ip6table_nat ip6table_filter ip6_tables iptable_nat nf_nat iptable_filter ip_tables x_tables leds_max20096 userspace_alert tegra_bpmp_thermal spi_tegra114 r8168 nv_imx264 max96793 max96792 lifmd_lvds2mipi_1 imx283 framos_common pwm_fan nvgpu nvmap leds_aquablue_uv ina3221 i2c_mux_aquablue gpio_aquablue nfnetlink
[88432.540221] CPU: 5 PID: 56020 Comm: stress-ng-fpunc Not tainted 5.10.216-l4t-35.6.0+g2978e57c0197 #1
[88432.540468] Hardware name: NVIDIA Aquabyte Boutan (Orin NX on Tokyo)/Jetson, BIOS v35.6.0 09/17/2024
[88432.540720] pstate: 60400009 (nZCv daif +PAN -UAO -TCO BTYPE=--)
[88432.541148] pc : ext4_journalled_invalidatepage+0x4c/0x60
[88432.541960] lr : ext4_journalled_invalidatepage+0x38/0x60
[88432.542780] sp : ffff800017cc3980
[88432.543281] x29: ffff800017cc3980 x28: 00000000004002ff
[88432.544102] x27: 0000000000401000 x26: 0000000000000400
[88432.547415] x25: ffff13225df7f478 x24: 0000000000001000
[88432.552929] x23: 0000000000000401 x22: 0000000000000000
[88432.558443] x21: 0000000000000c00 x20: 0000000000000400
[88432.563954] x19: fffffe4c89371d80 x18: 0000000000000000
[88432.569379] x17: 0000000000000000 x16: 0000000000000000
[88432.574892] x15: 0000000000000000 x14: 0000000000000000
[88432.580403] x13: 0000000000000000 x12: 0000000000000002
[88432.585916] x11: ffff1322dc61c248 x10: ffffffffffffffc0
[88432.591430] x9 : 0000000000000000 x8 : ffff132255c77000
[88432.596853] x7 : 0000000000000000 x6 : 000000000000003f
[88432.602279] x5 : 0000000000000040 x4 : 0000000000000000
[88432.607703] x3 : ffff132240e0c9c0 x2 : 0000000000000000
[88432.613041] x1 : 0000000000000000 x0 : 00000000fffffff0
[88432.618379] Call trace:
[88432.620832] ext4_journalled_invalidatepage+0x4c/0x60
[88432.625906] truncate_inode_pages_range+0x654/0x6c0
[88432.630716] truncate_pagecache_range+0x5c/0xa0
[88432.635268] ext4_punch_hole+0x44c/0x4e0
[88432.639206] ext4_fallocate+0x300/0x1044
[88432.642969] vfs_fallocate+0x110/0x260
[88432.646732] ovl_fallocate+0x12c/0x190
[88432.650668] vfs_fallocate+0x110/0x260
[88432.654429] ksys_fallocate+0x5c/0xa4
[88432.657931] __arm64_sys_fallocate+0x2c/0x3c
[88432.662219] el0_svc_common.constprop.0+0x80/0x1c0
[88432.667030] do_el0_svc+0x38/0xb0
[88432.670445] el0_svc+0xc/0x1c
[88432.673417] el0_sync_handler+0x100/0x10c
[88432.677617] el0_sync+0x16c/0x180
[88432.680856] ---[ end trace 5bea412ec3e3018a ]---
[88432.686924] ------------[ cut here ]------------
Our board then reboots shortly thereafter. All of the other stress-ng --class filesystem
tests appear to run without issues. This may not be the source of our filesystem corruption issues, but it does deserve further investigation.
Any suggestions for further testing here? Can others reproduce this issue on different hardware, to eliminate the possibility that there’s something wrong with our kit?