Kernel crashes on specific Orin NX modules

Hello!

We have some more insights now, which I’d like to share.

  • The crashes also occur on some Orin 8GB modules with an “A23” in the silkscreen label.
    Still, the crashes still seem to occur only on a subset of the modules.

  • I have built a Kernel 6.6. with some memory-related debugging options, including the kernel address sanitizer (KAZAN).
    Now, during each startup on a affected Orin module, a bug report is printed by KASAN, see attached kazan_bugreport.txt
    kasan_bugreport.txt (4.8 KB)

This bugreport is not printed on an Orin module that is not prone to kernel crashes reported on the original post.

BUG: KASAN: slab-out-of-bounds in nvgpu_gr_config_init+0x1468/0x1df8 [nvgpu]
[   17.210205] BUG: KASAN: slab-out-of-bounds in nvgpu_gr_config_init+0x1468/0x1df8 [nvgpu]
[   17.211220] Write of size 8 at addr ffff000090c5bf40 by task nvpmodel/934
[   17.211414] 
[   17.211462] CPU: 0 PID: 934 Comm: nvpmodel Tainted: G           O       6.6.23-<our_kernel_branch> #1
[   17.212944] Hardware name: <Our machine> with Orin NX 8 GB/Jetson, BIOS v36.4.0 10/01/2024
[   17.217289] Call trace:
...
[   17.241437]  nvgpu_gr_config_init+0x1468/0x1df8 [nvgpu]
[   17.246687]  gr_init_support_impl+0x120/0x6a8 [nvgpu]
[   17.251850]  nvgpu_gr_init_support+0x148/0x348 [nvgpu]
...
[   17.489325] The buggy address belongs to the object at ffff000090c5bf38
[   17.489325]  which belongs to the cache kmalloc-8 of size 8
[   17.501050] The buggy address is located 0 bytes to the right of
[   17.501050]  allocated 8-byte region [ffff000090c5bf38, ffff000090c5bf40)
...

I have tried to track the origin of the reported issue, and it looks like it occurs in this line
https://nv-tegra.nvidia.com/r/gitweb?p=linux-nvgpu.git;a=blob;f=drivers/gpu/nvgpu/common/gr/gr_config.c;h=c86bba610c6b1aa57da81824a0128a104e3cd0ae;hb=8a0a5345705e069e398a79dbcba96c5db54a37f1#l699
(apparently same code in BSP 36.4.0 and 36.4.4)

There, an array index “1” is write-accessed, but with debug-logging I found out that the array only was allocated to hold one element about 10 lines above, so this is apparently an array-out-of-bounds access issue.

I can imagine that if the reported nvgpu driver writes data to where it’s not supposed to, then a heap corruption occurs, which could theoretically cause crashes reported above.

I wonder, why this only happens on some specific Orin modules, and not on others.


Edit 2025-04-15: Fixed the link to the OOT module source code.