Hello!
We have some more insights now, which I’d like to share.
-
The crashes also occur on some Orin 8GB modules with an “A23” in the silkscreen label.
Still, the crashes still seem to occur only on a subset of the modules. -
I have built a Kernel 6.6. with some memory-related debugging options, including the kernel address sanitizer (KAZAN).
Now, during each startup on a affected Orin module, a bug report is printed by KASAN, see attached kazan_bugreport.txt
kasan_bugreport.txt (4.8 KB)
This bugreport is not printed on an Orin module that is not prone to kernel crashes reported on the original post.
BUG: KASAN: slab-out-of-bounds in nvgpu_gr_config_init+0x1468/0x1df8 [nvgpu]
[ 17.210205] BUG: KASAN: slab-out-of-bounds in nvgpu_gr_config_init+0x1468/0x1df8 [nvgpu]
[ 17.211220] Write of size 8 at addr ffff000090c5bf40 by task nvpmodel/934
[ 17.211414]
[ 17.211462] CPU: 0 PID: 934 Comm: nvpmodel Tainted: G O 6.6.23-<our_kernel_branch> #1
[ 17.212944] Hardware name: <Our machine> with Orin NX 8 GB/Jetson, BIOS v36.4.0 10/01/2024
[ 17.217289] Call trace:
...
[ 17.241437] nvgpu_gr_config_init+0x1468/0x1df8 [nvgpu]
[ 17.246687] gr_init_support_impl+0x120/0x6a8 [nvgpu]
[ 17.251850] nvgpu_gr_init_support+0x148/0x348 [nvgpu]
...
[ 17.489325] The buggy address belongs to the object at ffff000090c5bf38
[ 17.489325] which belongs to the cache kmalloc-8 of size 8
[ 17.501050] The buggy address is located 0 bytes to the right of
[ 17.501050] allocated 8-byte region [ffff000090c5bf38, ffff000090c5bf40)
...
I have tried to track the origin of the reported issue, and it looks like it occurs in this line
https://nv-tegra.nvidia.com/r/gitweb?p=linux-nvgpu.git;a=blob;f=drivers/gpu/nvgpu/common/gr/gr_config.c;h=c86bba610c6b1aa57da81824a0128a104e3cd0ae;hb=8a0a5345705e069e398a79dbcba96c5db54a37f1#l699
(apparently same code in BSP 36.4.0 and 36.4.4)
There, an array index “1” is write-accessed, but with debug-logging I found out that the array only was allocated to hold one element about 10 lines above, so this is apparently an array-out-of-bounds access issue.
I can imagine that if the reported nvgpu driver writes data to where it’s not supposed to, then a heap corruption occurs, which could theoretically cause crashes reported above.
I wonder, why this only happens on some specific Orin modules, and not on others.
Edit 2025-04-15: Fixed the link to the OOT module source code.