Kernel crashes on specific Orin NX modules

alexander.knaub · April 8, 2025, 9:54am

Hello!

We have built a couple of devices based on Jetson Orin NX 8GB and a custom carrier board.
We use BSP v36.4.0 with Linux Kernel 6.6.23.

Unfortunately, on specific Jetson modules we observe system freezes caused by kernel crashes which we cannot explain.
Here are some observations we’ve collected so far. I would be glad it would turn out to be a known issue and you could point us to a solution or a workaround.

The kernel crash message is always “Unable to handle kernel paging request at virtual address” and “address between user and kernel address ranges”
See attached snippet for an example.
The errors occur at seemingly random time points, usually around 1-3 minutes into Linux uptime, and the message contains PIDs of seemingly random userspace programs.
crash_message.txt (5.0 KB)
The sporadic crashes occur with some Jetson modules reproducibly (after a random number of reboots), on some others not at all.
We have also confirmed the dependency on the specific modules by cross-exchanging the modules and carriers in scenarios like this:
Jetson “A”, carrier “A” → crashes occur
Jetson “B”, carrier “B” → No crashes
Jetson “A”, carrier “B” → crashes occur
Jetson “B”, carrier “A” → No crashes

We have lists of serial numbers for at least 4 Jetson modules that are prone to crashes, and another 4 that run without such issues. I gave those lists to our Nvidia contact person who suggested that I write this forum post.
We see a correlation between the crashes and silkscreen printing on the Jetson PCBs. Maybe the correlation is just a coincidence but those modules that are prone to crashes all have a “A13” printed on them (e.g. “180-13767-DAAA-A13”), and all of those having a “A23” there (i.e. “180-13767-DAAA-A23”) run without issues. However, there are also modules labelled with “A13” and still not showing the crashing issue.
We have seen the PCN210361 announcing the change of DRAM in the BOM. According to that document it should be sufficient to use BSP 36.1 from Jetpack 6.0 for achieving the compatibility with all board versions, and we use Jetpack 6.1 with BSP v36.4.0.
We have been using modules with both of the two marking variants without the DRAM part encoded in the 2D barcode, and those with this information - according to PCN210341. In our observations, only those modules were affected that did not have the DRAM information in the barcode.

Thanks in advance for any hint that could lead us to a solution!

nagesh_accord · April 8, 2025, 10:24am

For r R36.3.0, it’s kernel version is 5.15.120 officially ?

alexander.knaub · April 8, 2025, 11:18am

Yes, and for R36.4.0 it is 5.15.148.
We made use of the “Bring your own kernel” option though and applied the necessary patches, in accordance with this reference: Bring Your Own Kernel — NVIDIA Jetson Linux Developer Guide 1 documentation.

kurt.kiefer · April 10, 2025, 2:35am

Just chiming in here to say I’ve got the exact same problem (some boots are ok, some are randomly crashy, always starting with allocation fails downstream of nvgpu), and also on an Orin NX 8GB module with both kernels 6.6 and 6.12 on top of 36.4.0. I don’t see this with an Orin AGX 64GB or with NVIDIA’s 5.15 kernel.

I’ve got two NX modules on my desk, one is an A13 (the one that crashes), the other is an A23, - that’s a 16 GB module. I’m set up only for the 8 GB module SKUs, so I haven’t been able to test the 16 GB module yet.

kurt.kiefer · April 14, 2025, 3:41pm

Update: I also wasn’t able to reproduce this issue with the A23 (16 GB) Orin module. It’s only a problem on the 8 GB Orin module.

alexander.knaub · April 14, 2025, 3:54pm

Hello!

We have some more insights now, which I’d like to share.

The crashes also occur on some Orin 8GB modules with an “A23” in the silkscreen label.
Still, the crashes still seem to occur only on a subset of the modules.
I have built a Kernel 6.6. with some memory-related debugging options, including the kernel address sanitizer (KAZAN).
Now, during each startup on a affected Orin module, a bug report is printed by KASAN, see attached kazan_bugreport.txt
kasan_bugreport.txt (4.8 KB)

This bugreport is not printed on an Orin module that is not prone to kernel crashes reported on the original post.

BUG: KASAN: slab-out-of-bounds in nvgpu_gr_config_init+0x1468/0x1df8 [nvgpu]
[   17.210205] BUG: KASAN: slab-out-of-bounds in nvgpu_gr_config_init+0x1468/0x1df8 [nvgpu]
[   17.211220] Write of size 8 at addr ffff000090c5bf40 by task nvpmodel/934
[   17.211414] 
[   17.211462] CPU: 0 PID: 934 Comm: nvpmodel Tainted: G           O       6.6.23-<our_kernel_branch> #1
[   17.212944] Hardware name: <Our machine> with Orin NX 8 GB/Jetson, BIOS v36.4.0 10/01/2024
[   17.217289] Call trace:
...
[   17.241437]  nvgpu_gr_config_init+0x1468/0x1df8 [nvgpu]
[   17.246687]  gr_init_support_impl+0x120/0x6a8 [nvgpu]
[   17.251850]  nvgpu_gr_init_support+0x148/0x348 [nvgpu]
...
[   17.489325] The buggy address belongs to the object at ffff000090c5bf38
[   17.489325]  which belongs to the cache kmalloc-8 of size 8
[   17.501050] The buggy address is located 0 bytes to the right of
[   17.501050]  allocated 8-byte region [ffff000090c5bf38, ffff000090c5bf40)
...

I have tried to track the origin of the reported issue, and it looks like it occurs in this line
https://nv-tegra.nvidia.com/r/gitweb?p=linux-nvgpu.git;a=blob;f=drivers/gpu/nvgpu/common/gr/gr_config.c;h=c86bba610c6b1aa57da81824a0128a104e3cd0ae;hb=8a0a5345705e069e398a79dbcba96c5db54a37f1#l699
(apparently same code in BSP 36.4.0 and 36.4.4)

There, an array index “1” is write-accessed, but with debug-logging I found out that the array only was allocated to hold one element about 10 lines above, so this is apparently an array-out-of-bounds access issue.

I can imagine that if the reported nvgpu driver writes data to where it’s not supposed to, then a heap corruption occurs, which could theoretically cause crashes reported above.

I wonder, why this only happens on some specific Orin modules, and not on others.

Edit 2025-04-15: Fixed the link to the OOT module source code.

kurt.kiefer · April 14, 2025, 9:08pm

Great find. I patched the allocation of gpc_tpc_physical_id_map to use the size gpc_count + 1, and this fixes the issue in question with the NX 8GB module, and works on other systems as well.

On the NX 8GB, we observe where gpu_instance_id/gpc_local_id is 0, the physical_id is 1. The gpc_count is only 1. On the AGX 64GB instance/local 0/0 maps to physical 0, and 0/1 maps to 1.

So it’s unclear precisely what the range expected physical ids should be, but the assumption that the max physical id is the same as the gpc_count doesn’t seem to be correct. On the AGX 64GB, the gpc_count is 2, and the ids are 0 and 1, so it is correct there, which explains why we didn’t see the crash, but that’s not the case on the NX 8GB. I haven’t been able to check the NX 16GB module.

alexander.knaub · April 15, 2025, 8:47am

We have also tried running a Linux 5.15 system on the same NX 8GB hardware, with the custom carrier.

The crashed indeed do not appear in this scenario. Did I understand your first posting right, @kurt.kiefer, that this was also your observation?

However, the KASAN also reports the “slab-out-of-bounds” bug on 5.15 from official BSP 36.4.0.

We suppose, the difference between 6.6 and 5.15 can be the order of initialization and system calls.
With 6.6 some other part of the Kernel might already have claimed the memory region later corrupted by the nvgpu driver, and the crash occurs when that part tries to interpret the contents as a virtual memory address.
On 5.15, the order might be different, and the affected memory area is not claimed by any other driver - but that cannot be just assumed as a given IMO.

kurt.kiefer · April 15, 2025, 4:57pm

That is also our experience, we don’t see the crash on 5.15. This is despite the fact it is not allocating enough space for the physical id mapping structure. Therefore, it is still a bug on that kernel that should be fixed, and it only works by chance.

kurt.kiefer · April 15, 2025, 6:25pm

0001-gpu-nvgpu-correct-size-of-gpc_tpc_physical_id_map.patch.txt (1.1 KB)

FYI here’s the patch I’m using to work around this issue for now, it seems to be working correctly.

DaneLLL · April 16, 2025, 1:57am

Hi,
Do you observe the issue when using default Jetpack 6.1(or 6.2)? We support the developer kit to be Orin NX module + Orin Nano carrier board. Would like to know if the Orin NX modules work on this setup. Or it is specific to using Kernel 6.6.

david.roerich · April 16, 2025, 5:58am

As @kurt.kiefer correctly pointed out it’s working on 5.15 only by chance. There clearly is a bug in the nvgpu module that corrupts the heap, as demonstrated in Kernel crashes on specific Orin NX modules - #8 by alexander.knaub. From there on, the behavior of the kernel is basically undefined. Could be you change a tiny thing in the 5.15 kernel and the crashes start to occur on 5.15 as well. Or it just occurs with lower frequency already and noone has observed it so far.

In my opinion there is no point in performing experiments on different systems with different kernel versions when the bug has already been identified. The workaround proposed by @kurt.kiefer seems to be working on the NX 8GB module but it would be great if NVIDIA provided a proper fix ASAP or at least confirmed that that workaround can be used until a proper fix exists.

AastaLLL · April 17, 2025, 2:20am

Hi,

Thanks for reporting this.

Could you help to check if the below changes can fix the issue?
We are going to merge this change to our internal branch so the future release should contain the fix by default.

diff --git a/drivers/gpu/nvgpu/common/gr/gr_config.c b/drivers/gpu/nvgpu/common/gr/gr_config.c
index c86bba6..42cf687 100644
--- a/drivers/gpu/nvgpu/common/gr/gr_config.c
+++ b/drivers/gpu/nvgpu/common/gr/gr_config.c
@@ -686,8 +667,9 @@
 	 * logical id.
 	 */
 	config->gpc_tpc_physical_id_map = nvgpu_kzalloc(g,
-			nvgpu_safe_mult_u32((size_t)config->gpc_count,
-				sizeof(u32 *)));
+			nvgpu_safe_mult_u32(
+				nvgpu_safe_cast_u64_to_u32((size_t)config->max_gpc_count),
+				(u32)sizeof(u32 *)));
 	if (config->gpc_tpc_physical_id_map == NULL) {
 		nvgpu_err(g, "alloc gpc_tpc_physical_id_map failed");
 		goto clean_up_gpc_rop_config;
@@ -696,6 +678,11 @@
 	for (gpc_index = 0; gpc_index < config->gpc_count; gpc_index++) {
 		gpc_phys_id = nvgpu_grmgr_get_gr_gpc_phys_id(g,
 				cur_gr_instance, (u32)gpc_index);
+		if (gpc_phys_id >= config->max_gpc_count) {
+			nvgpu_err(g, "gpc_phys_id: %u is >= max_gpc_count: %u",
+					gpc_phys_id, config->max_gpc_count);
+			goto clean_up_gpc_tpc_physical_id_map_alloc_fail;
+		}
 		config->gpc_tpc_physical_id_map[gpc_phys_id] =
 			nvgpu_kzalloc(g, nvgpu_safe_mult_u32(
 				config->max_tpc_per_gpc_count, sizeof(u32)));

Thanks.

kurt.kiefer · April 17, 2025, 5:06am

This is working for me. Thanks for getting a fix in for this.

AastaLLL · April 18, 2025, 2:12am

Hi,

Thanks for the confirmation.
This will be available in the upcoming release.

Topic		Replies	Views
Bring your own kernel - 6.6.29 kernel panics with Orin NX and custom carrier Jetson Orin NX kernel , board-design	24	1001	June 4, 2024
System crash Jetson Orin NX camera	2	219	July 26, 2024
NVRM gpumgrGetSomeGpu: Failed to retrieve pGpu when reboot on Orin NX module Jetson Orin NX board-design , reboot	21	2191	July 29, 2025
Orin system crash Jetson AGX Orin linux	14	1483	April 21, 2023
Unable to handle kernel paging request at virtual address ffff4d930a394000 Jetson AGX Orin	2	177	June 18, 2025
The orin module has caused a kernel crash Jetson AGX Orin reboot	17	373	September 24, 2025
Can't boot orin nx after flashing custom Kernel Jetson Orin NX boot , reflash , kernel	13	1210	January 16, 2024
Xavier nx (Jetpack5.1.1) system crash after "reboot" command Jetson Xavier NX boot , kernel	7	956	November 4, 2023
Jetpack5.1.5 oops many times for various reasons Jetson Orin NX boot , board-design	3	115	October 10, 2025
Unable to handle kernel paging request at virtual address Jetson Orin NX	5	195	September 30, 2025

Kernel crashes on specific Orin NX modules

Related topics