Kernel oops when running stress-ng fpunch test

zach-aquabyte · October 29, 2024, 5:49pm

We are trying to isolate sporadic problems with the NVMe drive in our Orin NX system. For the most, the drives are working well enough, but we have had a few end up with severe filesystem errors that require manual fsck.

Today, I started playing around with different stress tests in an attempt to isolate the problems. If I run stress-ng --fpunch 0, the R35.6.0 kernel (5.10.216) will crash almost instantly, giving tracebacks that look like the following:

[88432.538795] WARNING: CPU: 5 PID: 56020 at /fs/ext4/inode.c:3335 ext4_journalled_invalidatepage+0x4c/0x60
[88432.539111] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink xt_addrtype br_netfilter ip6table_nat ip6table_filter ip6_tables iptable_nat nf_nat iptable_filter ip_tables x_tables leds_max20096 userspace_alert tegra_bpmp_thermal spi_tegra114 r8168 nv_imx264 max96793 max96792 lifmd_lvds2mipi_1 imx283 framos_common pwm_fan nvgpu nvmap leds_aquablue_uv ina3221 i2c_mux_aquablue gpio_aquablue nfnetlink
[88432.540221] CPU: 5 PID: 56020 Comm: stress-ng-fpunc Not tainted 5.10.216-l4t-35.6.0+g2978e57c0197 #1
[88432.540468] Hardware name: NVIDIA Aquabyte Boutan (Orin NX on Tokyo)/Jetson, BIOS v35.6.0 09/17/2024
[88432.540720] pstate: 60400009 (nZCv daif +PAN -UAO -TCO BTYPE=--)
[88432.541148] pc : ext4_journalled_invalidatepage+0x4c/0x60
[88432.541960] lr : ext4_journalled_invalidatepage+0x38/0x60
[88432.542780] sp : ffff800017cc3980
[88432.543281] x29: ffff800017cc3980 x28: 00000000004002ff 
[88432.544102] x27: 0000000000401000 x26: 0000000000000400 
[88432.547415] x25: ffff13225df7f478 x24: 0000000000001000 
[88432.552929] x23: 0000000000000401 x22: 0000000000000000 
[88432.558443] x21: 0000000000000c00 x20: 0000000000000400 
[88432.563954] x19: fffffe4c89371d80 x18: 0000000000000000 
[88432.569379] x17: 0000000000000000 x16: 0000000000000000 
[88432.574892] x15: 0000000000000000 x14: 0000000000000000 
[88432.580403] x13: 0000000000000000 x12: 0000000000000002 
[88432.585916] x11: ffff1322dc61c248 x10: ffffffffffffffc0 
[88432.591430] x9 : 0000000000000000 x8 : ffff132255c77000 
[88432.596853] x7 : 0000000000000000 x6 : 000000000000003f 
[88432.602279] x5 : 0000000000000040 x4 : 0000000000000000 
[88432.607703] x3 : ffff132240e0c9c0 x2 : 0000000000000000 
[88432.613041] x1 : 0000000000000000 x0 : 00000000fffffff0 
[88432.618379] Call trace:
[88432.620832]  ext4_journalled_invalidatepage+0x4c/0x60
[88432.625906]  truncate_inode_pages_range+0x654/0x6c0
[88432.630716]  truncate_pagecache_range+0x5c/0xa0
[88432.635268]  ext4_punch_hole+0x44c/0x4e0
[88432.639206]  ext4_fallocate+0x300/0x1044
[88432.642969]  vfs_fallocate+0x110/0x260
[88432.646732]  ovl_fallocate+0x12c/0x190
[88432.650668]  vfs_fallocate+0x110/0x260
[88432.654429]  ksys_fallocate+0x5c/0xa4
[88432.657931]  __arm64_sys_fallocate+0x2c/0x3c
[88432.662219]  el0_svc_common.constprop.0+0x80/0x1c0
[88432.667030]  do_el0_svc+0x38/0xb0
[88432.670445]  el0_svc+0xc/0x1c
[88432.673417]  el0_sync_handler+0x100/0x10c
[88432.677617]  el0_sync+0x16c/0x180
[88432.680856] ---[ end trace 5bea412ec3e3018a ]---
[88432.686924] ------------[ cut here ]------------

Our board then reboots shortly thereafter. All of the other stress-ng --class filesystem tests appear to run without issues. This may not be the source of our filesystem corruption issues, but it does deserve further investigation.

Any suggestions for further testing here? Can others reproduce this issue on different hardware, to eliminate the possibility that there’s something wrong with our kit?

WayneWWW · October 30, 2024, 3:14am

Is this issue able to reproduced on devkit too?

zach-aquabyte · October 30, 2024, 6:24pm

I do not have a dev kit available for testing, only our custom board. That said, I am frustrated that “does is work on the devkit” is the standard response on this forum. These modules invariably end up in custom carrier boards, and Nvidia needs to provide support when we find potential bugs in your modules and software.

But to give you the benefit of the doubt: can you justify that canned response in this case, by providing a clear and detailed explanation about how our carrier board could be causing a problem with only this one filesystem stress test, when the other forty-eight tests pass without any issue? I sure can’t!

From where I’m sitting, this obviously looks to be a bug in the Nvidia-provided kernel, and it should manifest independently of any carrier board. In all likelihood, the problem manifests due to a race condition, because I have not been able to make this stress-ng test trigger it when using only one process. Command line invocations of fallocate -p work as expected.

Altogether, this situation leads me to suspect that Nvidia does not bother to run stress tests on these systems. If that is true, my confidence in your products would be severely undermined.

WayneWWW · November 1, 2024, 6:35am

Hi,

It is just a SOP to check if you have devkit to test there as here are many kinds of users here.

Such issue didn’t happen before so we need to check this locally. There won’t be any detailed explanation coming out soon before we figure out what is going on.

WayneWWW · November 6, 2024, 2:42am

$ stress-ng --fpunch 0
stress-ng: unrecognized option '--fpunch'

It looks like the default stress-ng seems not having the option, which version of stres-ng are you using?

zach-aquabyte · November 6, 2024, 6:40pm

$ stress-ng --version
stress-ng, version 0.17.05 (gcc 13.3.0, aarch64 Linux 5.10.216-l4t-35.6.0+ge740e62c52a2)

WayneWWW · November 8, 2024, 3:01am

This seems not a easy one to get reproduced. We start to run it many times but the crash didn’t happen.

system · December 18, 2024, 12:45am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Kernel crashes on specific Orin NX modules Jetson Orin NX kernel , board-design , linux	15	589	May 2, 2025
Stress-ng performance management get failed with segmentation fault in NVidia ORIN SOCs Jetson AGX Orin kernel	4	473	December 14, 2023
Bring your own kernel - 6.6.29 kernel panics with Orin NX and custom carrier Jetson Orin NX kernel , board-design	25	815	June 4, 2024
System issue Jetson Orin NX reboot	2	58	March 26, 2025
Abnormal display during stress test Jetson Orin NX tuning-testing-and-debug-tools	2	99	August 29, 2024
NVRM gpumgrGetSomeGpu: Failed to retrieve pGpu when reboot on Orin NX module Jetson Orin NX board-design , reboot	21	2058	July 29, 2025
Jetson Orin NX stops booting after 7 shutdowns Jetson Orin NX boot , board-design	11	329	December 5, 2024
EXT4-fs error (device nvme0n1p1): Jetson Orin NX	12	2616	June 6, 2023
Agx orin(35.4.1) 上gpu压测出现crash Jetson AGX Orin kernel	8	135	September 10, 2025
When I tested the camera, there was an exception Jetson Orin NX camera	2	101	August 27, 2024

Kernel oops when running stress-ng fpunch test

Related topics