Kernel crashes on specific Orin NX modules

Hello!

We have built a couple of devices based on Jetson Orin NX 8GB and a custom carrier board.
We use BSP v36.4.0 with Linux Kernel 6.6.23.

Unfortunately, on specific Jetson modules we observe system freezes caused by kernel crashes which we cannot explain.
Here are some observations we’ve collected so far. I would be glad it would turn out to be a known issue and you could point us to a solution or a workaround.

  • The kernel crash message is always “Unable to handle kernel paging request at virtual address” and “address between user and kernel address ranges”
    See attached snippet for an example.
    The errors occur at seemingly random time points, usually around 1-3 minutes into Linux uptime, and the message contains PIDs of seemingly random userspace programs.
    crash_message.txt (5.0 KB)

  • The sporadic crashes occur with some Jetson modules reproducibly (after a random number of reboots), on some others not at all.
    We have also confirmed the dependency on the specific modules by cross-exchanging the modules and carriers in scenarios like this:
    Jetson “A”, carrier “A” → crashes occur
    Jetson “B”, carrier “B” → No crashes
    Jetson “A”, carrier “B” → crashes occur
    Jetson “B”, carrier “A” → No crashes

    We have lists of serial numbers for at least 4 Jetson modules that are prone to crashes, and another 4 that run without such issues. I gave those lists to our Nvidia contact person who suggested that I write this forum post.

  • We see a correlation between the crashes and silkscreen printing on the Jetson PCBs. Maybe the correlation is just a coincidence but those modules that are prone to crashes all have a “A13” printed on them (e.g. “180-13767-DAAA-A13”), and all of those having a “A23” there (i.e. “180-13767-DAAA-A23”) run without issues. However, there are also modules labelled with “A13” and still not showing the crashing issue.

  • We have seen the PCN210361 announcing the change of DRAM in the BOM. According to that document it should be sufficient to use BSP 36.1 from Jetpack 6.0 for achieving the compatibility with all board versions, and we use Jetpack 6.1 with BSP v36.4.0.

  • We have been using modules with both of the two marking variants without the DRAM part encoded in the 2D barcode, and those with this information - according to PCN210341. In our observations, only those modules were affected that did not have the DRAM information in the barcode.

Thanks in advance for any hint that could lead us to a solution!

1 Like

For r R36.3.0, it’s kernel version is 5.15.120 officially ?

Yes, and for R36.4.0 it is 5.15.148.
We made use of the “Bring your own kernel” option though and applied the necessary patches, in accordance with this reference: Bring Your Own Kernel — NVIDIA Jetson Linux Developer Guide 1 documentation.

2 Likes

Just chiming in here to say I’ve got the exact same problem (some boots are ok, some are randomly crashy, always starting with allocation fails downstream of nvgpu), and also on an Orin NX 8GB module with both kernels 6.6 and 6.12 on top of 36.4.0. I don’t see this with an Orin AGX 64GB or with NVIDIA’s 5.15 kernel.

I’ve got two NX modules on my desk, one is an A13 (the one that crashes), the other is an A23, - that’s a 16 GB module. I’m set up only for the 8 GB module SKUs, so I haven’t been able to test the 16 GB module yet.

1 Like

Update: I also wasn’t able to reproduce this issue with the A23 (16 GB) Orin module. It’s only a problem on the 8 GB Orin module.

1 Like

Hello!

We have some more insights now, which I’d like to share.

  • The crashes also occur on some Orin 8GB modules with an “A23” in the silkscreen label.
    Still, the crashes still seem to occur only on a subset of the modules.

  • I have built a Kernel 6.6. with some memory-related debugging options, including the kernel address sanitizer (KAZAN).
    Now, during each startup on a affected Orin module, a bug report is printed by KASAN, see attached kazan_bugreport.txt
    kasan_bugreport.txt (4.8 KB)

This bugreport is not printed on an Orin module that is not prone to kernel crashes reported on the original post.

BUG: KASAN: slab-out-of-bounds in nvgpu_gr_config_init+0x1468/0x1df8 [nvgpu]
[   17.210205] BUG: KASAN: slab-out-of-bounds in nvgpu_gr_config_init+0x1468/0x1df8 [nvgpu]
[   17.211220] Write of size 8 at addr ffff000090c5bf40 by task nvpmodel/934
[   17.211414] 
[   17.211462] CPU: 0 PID: 934 Comm: nvpmodel Tainted: G           O       6.6.23-<our_kernel_branch> #1
[   17.212944] Hardware name: <Our machine> with Orin NX 8 GB/Jetson, BIOS v36.4.0 10/01/2024
[   17.217289] Call trace:
...
[   17.241437]  nvgpu_gr_config_init+0x1468/0x1df8 [nvgpu]
[   17.246687]  gr_init_support_impl+0x120/0x6a8 [nvgpu]
[   17.251850]  nvgpu_gr_init_support+0x148/0x348 [nvgpu]
...
[   17.489325] The buggy address belongs to the object at ffff000090c5bf38
[   17.489325]  which belongs to the cache kmalloc-8 of size 8
[   17.501050] The buggy address is located 0 bytes to the right of
[   17.501050]  allocated 8-byte region [ffff000090c5bf38, ffff000090c5bf40)
...

I have tried to track the origin of the reported issue, and it looks like it occurs in this line
https://nv-tegra.nvidia.com/r/gitweb?p=linux-nvgpu.git;a=blob;f=drivers/gpu/nvgpu/common/gr/gr_config.c;h=c86bba610c6b1aa57da81824a0128a104e3cd0ae;hb=8a0a5345705e069e398a79dbcba96c5db54a37f1#l699
(apparently same code in BSP 36.4.0 and 36.4.4)

There, an array index “1” is write-accessed, but with debug-logging I found out that the array only was allocated to hold one element about 10 lines above, so this is apparently an array-out-of-bounds access issue.

I can imagine that if the reported nvgpu driver writes data to where it’s not supposed to, then a heap corruption occurs, which could theoretically cause crashes reported above.

I wonder, why this only happens on some specific Orin modules, and not on others.


Edit 2025-04-15: Fixed the link to the OOT module source code.

2 Likes

Great find. I patched the allocation of gpc_tpc_physical_id_map to use the size gpc_count + 1, and this fixes the issue in question with the NX 8GB module, and works on other systems as well.

On the NX 8GB, we observe where gpu_instance_id/gpc_local_id is 0, the physical_id is 1. The gpc_count is only 1. On the AGX 64GB instance/local 0/0 maps to physical 0, and 0/1 maps to 1.

So it’s unclear precisely what the range expected physical ids should be, but the assumption that the max physical id is the same as the gpc_count doesn’t seem to be correct. On the AGX 64GB, the gpc_count is 2, and the ids are 0 and 1, so it is correct there, which explains why we didn’t see the crash, but that’s not the case on the NX 8GB. I haven’t been able to check the NX 16GB module.

2 Likes

We have also tried running a Linux 5.15 system on the same NX 8GB hardware, with the custom carrier.

The crashed indeed do not appear in this scenario. Did I understand your first posting right, @kurt.kiefer, that this was also your observation?

However, the KASAN also reports the “slab-out-of-bounds” bug on 5.15 from official BSP 36.4.0.

We suppose, the difference between 6.6 and 5.15 can be the order of initialization and system calls.
With 6.6 some other part of the Kernel might already have claimed the memory region later corrupted by the nvgpu driver, and the crash occurs when that part tries to interpret the contents as a virtual memory address.
On 5.15, the order might be different, and the affected memory area is not claimed by any other driver - but that cannot be just assumed as a given IMO.

That is also our experience, we don’t see the crash on 5.15. This is despite the fact it is not allocating enough space for the physical id mapping structure. Therefore, it is still a bug on that kernel that should be fixed, and it only works by chance.

1 Like

0001-gpu-nvgpu-correct-size-of-gpc_tpc_physical_id_map.patch.txt (1.1 KB)

FYI here’s the patch I’m using to work around this issue for now, it seems to be working correctly.

3 Likes

Hi,
Do you observe the issue when using default Jetpack 6.1(or 6.2)? We support the developer kit to be Orin NX module + Orin Nano carrier board. Would like to know if the Orin NX modules work on this setup. Or it is specific to using Kernel 6.6.

As @kurt.kiefer correctly pointed out it’s working on 5.15 only by chance. There clearly is a bug in the nvgpu module that corrupts the heap, as demonstrated in Kernel crashes on specific Orin NX modules - #8 by alexander.knaub. From there on, the behavior of the kernel is basically undefined. Could be you change a tiny thing in the 5.15 kernel and the crashes start to occur on 5.15 as well. Or it just occurs with lower frequency already and noone has observed it so far.

In my opinion there is no point in performing experiments on different systems with different kernel versions when the bug has already been identified. The workaround proposed by @kurt.kiefer seems to be working on the NX 8GB module but it would be great if NVIDIA provided a proper fix ASAP or at least confirmed that that workaround can be used until a proper fix exists.

2 Likes

Hi,

Thanks for reporting this.

Could you help to check if the below changes can fix the issue?
We are going to merge this change to our internal branch so the future release should contain the fix by default.

diff --git a/drivers/gpu/nvgpu/common/gr/gr_config.c b/drivers/gpu/nvgpu/common/gr/gr_config.c
index c86bba6..42cf687 100644
--- a/drivers/gpu/nvgpu/common/gr/gr_config.c
+++ b/drivers/gpu/nvgpu/common/gr/gr_config.c
@@ -686,8 +667,9 @@
 	 * logical id.
 	 */
 	config->gpc_tpc_physical_id_map = nvgpu_kzalloc(g,
-			nvgpu_safe_mult_u32((size_t)config->gpc_count,
-				sizeof(u32 *)));
+			nvgpu_safe_mult_u32(
+				nvgpu_safe_cast_u64_to_u32((size_t)config->max_gpc_count),
+				(u32)sizeof(u32 *)));
 	if (config->gpc_tpc_physical_id_map == NULL) {
 		nvgpu_err(g, "alloc gpc_tpc_physical_id_map failed");
 		goto clean_up_gpc_rop_config;
@@ -696,6 +678,11 @@
 	for (gpc_index = 0; gpc_index < config->gpc_count; gpc_index++) {
 		gpc_phys_id = nvgpu_grmgr_get_gr_gpc_phys_id(g,
 				cur_gr_instance, (u32)gpc_index);
+		if (gpc_phys_id >= config->max_gpc_count) {
+			nvgpu_err(g, "gpc_phys_id: %u is >= max_gpc_count: %u",
+					gpc_phys_id, config->max_gpc_count);
+			goto clean_up_gpc_tpc_physical_id_map_alloc_fail;
+		}
 		config->gpc_tpc_physical_id_map[gpc_phys_id] =
 			nvgpu_kzalloc(g, nvgpu_safe_mult_u32(
 				config->max_tpc_per_gpc_count, sizeof(u32)));

Thanks.

4 Likes

This is working for me. Thanks for getting a fix in for this.

Hi,

Thanks for the confirmation.
This will be available in the upcoming release.

2 Likes