Kernel oops when running stress-ng fpunch test

We are trying to isolate sporadic problems with the NVMe drive in our Orin NX system. For the most, the drives are working well enough, but we have had a few end up with severe filesystem errors that require manual fsck.

Today, I started playing around with different stress tests in an attempt to isolate the problems. If I run stress-ng --fpunch 0, the R35.6.0 kernel (5.10.216) will crash almost instantly, giving tracebacks that look like the following:

[88432.538795] WARNING: CPU: 5 PID: 56020 at /fs/ext4/inode.c:3335 ext4_journalled_invalidatepage+0x4c/0x60
[88432.539111] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink xt_addrtype br_netfilter ip6table_nat ip6table_filter ip6_tables iptable_nat nf_nat iptable_filter ip_tables x_tables leds_max20096 userspace_alert tegra_bpmp_thermal spi_tegra114 r8168 nv_imx264 max96793 max96792 lifmd_lvds2mipi_1 imx283 framos_common pwm_fan nvgpu nvmap leds_aquablue_uv ina3221 i2c_mux_aquablue gpio_aquablue nfnetlink
[88432.540221] CPU: 5 PID: 56020 Comm: stress-ng-fpunc Not tainted 5.10.216-l4t-35.6.0+g2978e57c0197 #1
[88432.540468] Hardware name: NVIDIA Aquabyte Boutan (Orin NX on Tokyo)/Jetson, BIOS v35.6.0 09/17/2024
[88432.540720] pstate: 60400009 (nZCv daif +PAN -UAO -TCO BTYPE=--)
[88432.541148] pc : ext4_journalled_invalidatepage+0x4c/0x60
[88432.541960] lr : ext4_journalled_invalidatepage+0x38/0x60
[88432.542780] sp : ffff800017cc3980
[88432.543281] x29: ffff800017cc3980 x28: 00000000004002ff 
[88432.544102] x27: 0000000000401000 x26: 0000000000000400 
[88432.547415] x25: ffff13225df7f478 x24: 0000000000001000 
[88432.552929] x23: 0000000000000401 x22: 0000000000000000 
[88432.558443] x21: 0000000000000c00 x20: 0000000000000400 
[88432.563954] x19: fffffe4c89371d80 x18: 0000000000000000 
[88432.569379] x17: 0000000000000000 x16: 0000000000000000 
[88432.574892] x15: 0000000000000000 x14: 0000000000000000 
[88432.580403] x13: 0000000000000000 x12: 0000000000000002 
[88432.585916] x11: ffff1322dc61c248 x10: ffffffffffffffc0 
[88432.591430] x9 : 0000000000000000 x8 : ffff132255c77000 
[88432.596853] x7 : 0000000000000000 x6 : 000000000000003f 
[88432.602279] x5 : 0000000000000040 x4 : 0000000000000000 
[88432.607703] x3 : ffff132240e0c9c0 x2 : 0000000000000000 
[88432.613041] x1 : 0000000000000000 x0 : 00000000fffffff0 
[88432.618379] Call trace:
[88432.620832]  ext4_journalled_invalidatepage+0x4c/0x60
[88432.625906]  truncate_inode_pages_range+0x654/0x6c0
[88432.630716]  truncate_pagecache_range+0x5c/0xa0
[88432.635268]  ext4_punch_hole+0x44c/0x4e0
[88432.639206]  ext4_fallocate+0x300/0x1044
[88432.642969]  vfs_fallocate+0x110/0x260
[88432.646732]  ovl_fallocate+0x12c/0x190
[88432.650668]  vfs_fallocate+0x110/0x260
[88432.654429]  ksys_fallocate+0x5c/0xa4
[88432.657931]  __arm64_sys_fallocate+0x2c/0x3c
[88432.662219]  el0_svc_common.constprop.0+0x80/0x1c0
[88432.667030]  do_el0_svc+0x38/0xb0
[88432.670445]  el0_svc+0xc/0x1c
[88432.673417]  el0_sync_handler+0x100/0x10c
[88432.677617]  el0_sync+0x16c/0x180
[88432.680856] ---[ end trace 5bea412ec3e3018a ]---
[88432.686924] ------------[ cut here ]------------

Our board then reboots shortly thereafter. All of the other stress-ng --class filesystem tests appear to run without issues. This may not be the source of our filesystem corruption issues, but it does deserve further investigation.

Any suggestions for further testing here? Can others reproduce this issue on different hardware, to eliminate the possibility that there’s something wrong with our kit?

Is this issue able to reproduced on devkit too?

I do not have a dev kit available for testing, only our custom board. That said, I am frustrated that “does is work on the devkit” is the standard response on this forum. These modules invariably end up in custom carrier boards, and Nvidia needs to provide support when we find potential bugs in your modules and software.

But to give you the benefit of the doubt: can you justify that canned response in this case, by providing a clear and detailed explanation about how our carrier board could be causing a problem with only this one filesystem stress test, when the other forty-eight tests pass without any issue? I sure can’t!

From where I’m sitting, this obviously looks to be a bug in the Nvidia-provided kernel, and it should manifest independently of any carrier board. In all likelihood, the problem manifests due to a race condition, because I have not been able to make this stress-ng test trigger it when using only one process. Command line invocations of fallocate -p work as expected.

Altogether, this situation leads me to suspect that Nvidia does not bother to run stress tests on these systems. If that is true, my confidence in your products would be severely undermined.

Hi,

It is just a SOP to check if you have devkit to test there as here are many kinds of users here.

Such issue didn’t happen before so we need to check this locally. There won’t be any detailed explanation coming out soon before we figure out what is going on.

$ stress-ng --fpunch 0
stress-ng: unrecognized option '--fpunch'

It looks like the default stress-ng seems not having the option, which version of stres-ng are you using?

$ stress-ng --version
stress-ng, version 0.17.05 (gcc 13.3.0, aarch64 Linux 5.10.216-l4t-35.6.0+ge740e62c52a2)

This seems not a easy one to get reproduced. We start to run it many times but the crash didn’t happen.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.