UBSAN: array-index-out-of-bounds in /var/lib/dkms/nvidia/535.129.03/build/nvidia-uvm/uvm_pmm_gpu.c:829:45

Hello.

I’ve upgraded Ubuntu 23.04 to 23.10 recently. On my Ubuntu 23.10 I’m using kernel 6.5.0-10-generic and I’ve installed the nVidia driver version 535.129.03. (my nvidia gpu is the RTX 2080 ti ; my cpu is the intel I9)

Not exactly the same bug of this :

because the error in the Ubuntu bug report is for a different kernel module. But the underlying cause is probably the same.

Possibly the code for the nvidia-uvm module is designed for kernel versions < 6.5, so when Ubuntu upgraded to Linux kernel 6.5, it broke some modules because of changes to UBSAN in Linux 6.5 which causes modules such as nvidia-uvm to need patches to be compatible with Linux 6.5, but either nvidia has not yet provided a version of nvidia-uvm that is compatible with Linux 6.5 or Ubuntu neglected to apply an updated version from nvidia that is compatible with Linux 6.5.

I didn’t see this error on Ubuntu 23.04,maybe because it does not use the kernel 6.5 by default,but 23.10 does.

I see a lot of those errors when I issue the command “dmesg” and any audio-video streamings don’t flow.

Log :

[ 15.029102] UBSAN: array-index-out-of-bounds in /var/lib/dkms/nvidia/535.129.03/build/nvidia-uvm/uvm_pmm_gpu.c:829:45

[ 15.031655] index 0 is out of range for type 'uvm_gpu_chunk_t *[*]'
[ 15.034248] CPU: 9 PID: 2571 Comm: ffdetect Tainted: P OE 6.5.0-10-generic #10-Ubuntu
[ 15.034249] Hardware name: Gigabyte Technology Co., Ltd. Z390 AORUS PRO/Z390 AORUS PRO-CF, BIOS F12g GA9 06/08/2020
[ 15.034250] Call Trace:
[ 15.034251] <TASK>
[ 15.034251] dump_stack_lvl+0x48/0x70
[ 15.034255] dump_stack+0x10/0x20
[ 15.034257] __ubsan_handle_out_of_bounds+0xc6/0x110
[ 15.034259] merge_gpu_chunk+0x57/0x1d0 [nvidia_uvm]
[ 15.034293] free_chunk_with_merges+0x13d/0x180 [nvidia_uvm]
[ 15.034325] free_chunk+0xa4/0xd0 [nvidia_uvm]
[ 15.034355] uvm_pmm_gpu_free+0xbf/0xf0 [nvidia_uvm]
[ 15.034386] phys_mem_deallocate+0x33/0xd0 [nvidia_uvm]
[ 15.034422] uvm_page_tree_put_ptes_async+0x4d5/0x580 [nvidia_uvm]
[ 15.034459] uvm_page_table_range_vec_deinit+0x3e/0xd0 [nvidia_uvm]
[ 15.034494] uvm_va_range_destroy+0x14d/0x590 [nvidia_uvm]
[ 15.034527] ? os_release_spinlock+0x1a/0x30 [nvidia]
[ 15.034792] ? uvm_kvfree+0x30/0x70 [nvidia_uvm]
[ 15.034826] destroy_va_ranges.part.0+0x61/0x90 [nvidia_uvm]
[ 15.034857] uvm_user_channel_detach+0x9e/0xe0 [nvidia_uvm]
[ 15.034886] uvm_api_unregister_channel+0xee/0x1a0 [nvidia_uvm]
[ 15.034915] uvm_ioctl+0x1a04/0x1cd0 [nvidia_uvm]
[ 15.034939] ? uvm_api_unregister_channel+0x134/0x1a0 [nvidia_uvm]
[ 15.034968] ? _copy_to_user+0x25/0x70
[ 15.034970] ? uvm_ioctl+0x5cc/0x1cd0 [nvidia_uvm]
[ 15.034994] ? _raw_spin_lock_irqsave+0xe/0x20
[ 15.034996] ? thread_context_non_interrupt_add+0x13a/0x2c0 [nvidia_uvm]
[ 15.035031] uvm_unlocked_ioctl_entry.part.0+0x7b/0xf0 [nvidia_uvm]
[ 15.035055] ? uvm_thread_context_remove+0x39/0x50 [nvidia_uvm]
[ 15.035091] uvm_unlocked_ioctl_entry+0x6b/0x90 [nvidia_uvm]
[ 15.035115] __x64_sys_ioctl+0xa0/0xf0
[ 15.035116] do_syscall_64+0x59/0x90
[ 15.035118] ? __rseq_handle_notify_resume+0x37/0x70
[ 15.035119] ? exit_to_user_mode_loop+0xe0/0x130
[ 15.035122] ? exit_to_user_mode_prepare+0x9b/0xb0
[ 15.035123] ? syscall_exit_to_user_mode+0x37/0x60
[ 15.035125] ? do_syscall_64+0x68/0x90
[ 15.035126] ? syscall_exit_to_user_mode+0x37/0x60
[ 15.035128] ? do_syscall_64+0x68/0x90
[ 15.035129] ? syscall_exit_to_user_mode+0x37/0x60
[ 15.035130] ? do_syscall_64+0x68/0x90
[ 15.035131] ? do_syscall_64+0x68/0x90
[ 15.035133] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 15.035134] RIP: 0033:0x7f9b9e7238ef
[ 15.035144] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00
00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 15.035145] RSP: 002b:00007fff0ed9b9a0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 15.035147] RAX: ffffffffffffffda RBX: 000000000239f8b8 RCX: 00007f9b9e7238ef
[ 15.035147] RDX: 00007fff0ed9ba10 RSI: 000000000000001c RDI: 0000000000000004
[ 15.035148] RBP: 00007fff0ed9ba50 R08: 000000000000242a R09: 0000000000000007
[ 15.035149] R10: 000000000242a3c0 R11: 0000000000000246 R12: 00007fff0ed9ba10
[ 15.035150] R13: 0000000000000004 R14: 0000000002506600 R15: 000000000237c138
[ 15.035151] </TASK>
[ 15.035152] ==========================================================
[ 15.037818] ==========================================================
[ 15.040413] UBSAN: array-index-out-of-bounds in /var/lib/dkms/nvidia/535.129.03/build/nvidia-uvm/uvm_pmm_gpu.c:857:39

[ 15.043033] index 0 is out of range for type 'uvm_gpu_chunk_t *[*]'
[ 15.045636] CPU: 9 PID: 2571 Comm: ffdetect Tainted: P OE 6.5.0-10-generic #10-Ubuntu
[ 15.045639] Hardware name: Gigabyte Technology Co., Ltd. Z390 AORUS PRO/Z390 AORUS PRO-CF, BIOS F12g GA9 06/08/2020
[ 15.045640] Call Trace:
[ 15.045641] <TASK>
[ 15.045642] dump_stack_lvl+0x48/0x70
[ 15.045647] dump_stack+0x10/0x20
[ 15.045649] __ubsan_handle_out_of_bounds+0xc6/0x110
[ 15.045652] merge_gpu_chunk+0xc6/0x1d0 [nvidia_uvm]
[ 15.045702] free_chunk_with_merges+0x13d/0x180 [nvidia_uvm]
[ 15.045734] free_chunk+0xa4/0xd0 [nvidia_uvm]
[ 15.045765] uvm_pmm_gpu_free+0xbf/0xf0 [nvidia_uvm]
[ 15.045795] phys_mem_deallocate+0x33/0xd0 [nvidia_uvm]
[ 15.045831] uvm_page_tree_put_ptes_async+0x4d5/0x580 [nvidia_uvm]
[ 15.045868] uvm_page_table_range_vec_deinit+0x3e/0xd0 [nvidia_uvm]
[ 15.045904] uvm_va_range_destroy+0x14d/0x590 [nvidia_uvm]
[ 15.045936] ? os_release_spinlock+0x1a/0x30 [nvidia]
[ 15.046201] ? uvm_kvfree+0x30/0x70 [nvidia_uvm]
[ 15.046236] destroy_va_ranges.part.0+0x61/0x90 [nvidia_uvm]
[ 15.046277] uvm_user_channel_detach+0x9e/0xe0 [nvidia_uvm]
[ 15.046315] uvm_api_unregister_channel+0xee/0x1a0 [nvidia_uvm]
[ 15.046354] uvm_ioctl+0x1a04/0x1cd0 [nvidia_uvm]
[ 15.046388] ? uvm_api_unregister_channel+0x134/0x1a0 [nvidia_uvm]
[ 15.046427] ? _copy_to_user+0x25/0x70
[ 15.046429] ? uvm_ioctl+0x5cc/0x1cd0 [nvidia_uvm]
[ 15.046463] ? _raw_spin_lock_irqsave+0xe/0x20
[ 15.046466] ? thread_context_non_interrupt_add+0x13a/0x2c0 [nvidia_uvm]
[ 15.046511] uvm_unlocked_ioctl_entry.part.0+0x7b/0xf0 [nvidia_uvm]
[ 15.046544] ? uvm_thread_context_remove+0x39/0x50 [nvidia_uvm]
[ 15.046589] uvm_unlocked_ioctl_entry+0x6b/0x90 [nvidia_uvm]
[ 15.046622] __x64_sys_ioctl+0xa0/0xf0
[ 15.046625] do_syscall_64+0x59/0x90
[ 15.046627] ? __rseq_handle_notify_resume+0x37/0x70
[ 15.046629] ? exit_to_user_mode_loop+0xe0/0x130
[ 15.046632] ? exit_to_user_mode_prepare+0x9b/0xb0
[ 15.046634] ? syscall_exit_to_user_mode+0x37/0x60
[ 15.046636] ? do_syscall_64+0x68/0x90
[ 15.046638] ? syscall_exit_to_user_mode+0x37/0x60
[ 15.046639] ? do_syscall_64+0x68/0x90
[ 15.046641] ? syscall_exit_to_user_mode+0x37/0x60
[ 15.046642] ? do_syscall_64+0x68/0x90
[ 15.046644] ? do_syscall_64+0x68/0x90
[ 15.046645] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 15.046648] RIP: 0033:0x7f9b9e7238ef
[ 15.046668] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00
00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 15.046669] RSP: 002b:00007fff0ed9b9a0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 15.046671] RAX: ffffffffffffffda RBX: 000000000239f8b8 RCX: 00007f9b9e7238ef
[ 15.046672] RDX: 00007fff0ed9ba10 RSI: 000000000000001c RDI: 0000000000000004
[ 15.046673] RBP: 00007fff0ed9ba50 R08: 000000000000242a R09: 0000000000000007
[ 15.046674] R10: 000000000242a3c0 R11: 0000000000000246 R12: 00007fff0ed9ba10
[ 15.046675] R13: 0000000000000004 R14: 0000000002506600 R15: 000000000237c138
[ 15.046677] </TASK>
[ 15.046678] ==========================================================
1 Like

I have seen this as well on Ubuntu 23.10. I have tried both the 535 and 545 versions of the nvidia driver and I get the same results. For me it happens consistently when trying to play a game in Steam in a wayland session.

I get it all the time on the host with Ubuntu 23.10 and kernel 6.5.

Not doing any kind of video streaming or gaming. But the cards are in use by BOINC computing all the time.

I’m experiencing a very odd behavior. If I turn the audio off, the video stream moves on. If I turn the audio on, the video stream freezes. This happens with the driver 535.129.03 with Ubuntu 23.10 and the kernel 6.5.0-generic and 6.2.0.36-generic. Regarding the driver 545,everything works ok in one of my Ubuntu 23.10 installation. I tried to repeat the steps which worked on a second Ubuntu 23.10 installation,but it didn’t work. So,I didn’t understand what’s the trick.

I see this too, Ubuntu 23.10, Linux kernel 6.5.0, NVIDIA driver 545 with modeset=1 running Blender under Wayland:

2023-11-14T12:00:44.240234+01:00 Pampelmuse kernel: [13278.121170] UBSAN: array-index-out-of-bounds in /home/gonsolo/Downloads/NVIDIA-Linux-x86_64-545.29.02/kernel/nvidia-uvm/uvm_pmm_gpu.c:2364:28
2023-11-14T12:00:44.240235+01:00 Pampelmuse kernel: [13278.121176] index 0 is out of range for type ‘uvm_gpu_chunk_t []’
2023-11-14T12:00:44.240236+01:00 Pampelmuse kernel: [13278.121182] CPU: 20 PID: 32457 Comm: CMakeDetermineC Tainted: P OE 6.5.0-10-generic #10-Ubuntu
2023-11-14T12:00:44.240237+01:00 Pampelmuse kernel: [13278.121187] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X399 Phantom Gaming 6, BIOS P1.31 01/14/2021
2023-11-14T12:00:44.240237+01:00 Pampelmuse kernel: [13278.121191] Call Trace:
2023-11-14T12:00:44.240237+01:00 Pampelmuse kernel: [13278.121195]
2023-11-14T12:00:44.240238+01:00 Pampelmuse kernel: [13278.121201] dump_stack_lvl+0x48/0x70
2023-11-14T12:00:44.240239+01:00 Pampelmuse kernel: [13278.121218] dump_stack+0x10/0x20
2023-11-14T12:00:44.240239+01:00 Pampelmuse kernel: [13278.121224] __ubsan_handle_out_of_bounds+0xc6/0x110
2023-11-14T12:00:44.240240+01:00 Pampelmuse kernel: [13278.121235] split_gpu_chunk+0x13f/0x410 [nvidia_uvm]
2023-11-14T12:00:44.240240+01:00 Pampelmuse kernel: [13278.121316] uvm_pmm_gpu_alloc+0x2da/0x6d0 [nvidia_uvm]
2023-11-14T12:00:44.240241+01:00 Pampelmuse kernel: [13278.121396] phys_mem_allocate+0xac/0x230 [nvidia_uvm]
2023-11-14T12:00:44.240241+01:00 Pampelmuse kernel: [13278.121482] allocate_directory+0xb4/0x130 [nvidia_uvm]
2023-11-14T12:00:44.240242+01:00 Pampelmuse kernel: [13278.121563] ? allocate_directory+0xb4/0x130 [nvidia_uvm]
2023-11-14T12:00:44.240242+01:00 Pampelmuse kernel: [13278.121646] uvm_page_tree_init+0x12c/0x2e0 [nvidia_uvm]
2023-11-14T12:00:44.240242+01:00 Pampelmuse kernel: [13278.121733] uvm_gpu_retain_by_uuid+0x1a2b/0x2bb0 [nvidia_uvm]
2023-11-14T12:00:44.240243+01:00 Pampelmuse kernel: [13278.121811] uvm_va_space_register_gpu+0x47/0x740 [nvidia_uvm]
2023-11-14T12:00:44.240244+01:00 Pampelmuse kernel: [13278.121877] ? srso_return_thunk+0x5/0x10
2023-11-14T12:00:44.240244+01:00 Pampelmuse kernel: [13278.121884] ? __getblk_gfp+0x2b/0x80
2023-11-14T12:00:44.240245+01:00 Pampelmuse kernel: [13278.121895] uvm_api_register_gpu+0x5a/0x90 [nvidia_uvm]
2023-11-14T12:00:44.240245+01:00 Pampelmuse kernel: [13278.121961] uvm_ioctl+0x1a26/0x1cd0 [nvidia_uvm]
2023-11-14T12:00:44.240245+01:00 Pampelmuse kernel: [13278.122022] ? srso_return_thunk+0x5/0x10
2023-11-14T12:00:44.240246+01:00 Pampelmuse kernel: [13278.122031] ? __ext4_iget+0x9d1/0x1130
2023-11-14T12:00:44.240246+01:00 Pampelmuse kernel: [13278.122037] ? srso_return_thunk+0x5/0x10
2023-11-14T12:00:44.240247+01:00 Pampelmuse kernel: [13278.122043] ? srso_return_thunk+0x5/0x10
2023-11-14T12:00:44.240247+01:00 Pampelmuse kernel: [13278.122047] ? __d_add+0x118/0x1e0
2023-11-14T12:00:44.240248+01:00 Pampelmuse kernel: [13278.122055] ? srso_return_thunk+0x5/0x10
2023-11-14T12:00:44.240248+01:00 Pampelmuse kernel: [13278.122060] ? __do_sys_newuname+0xd5/0x140
2023-11-14T12:00:44.240248+01:00 Pampelmuse kernel: [13278.122069] ? srso_return_thunk+0x5/0x10
2023-11-14T12:00:44.240249+01:00 Pampelmuse kernel: [13278.122074] ? _raw_spin_lock_irqsave+0xe/0x20
2023-11-14T12:00:44.240249+01:00 Pampelmuse kernel: [13278.122079] ? srso_return_thunk+0x5/0x10
2023-11-14T12:00:44.240250+01:00 Pampelmuse kernel: [13278.122084] ? thread_context_non_interrupt_add+0x13a/0x2c0 [nvidia_uvm]
2023-11-14T12:00:44.240250+01:00 Pampelmuse kernel: [13278.122171] ? srso_return_thunk+0x5/0x10
2023-11-14T12:00:44.240250+01:00 Pampelmuse kernel: [13278.122178] uvm_unlocked_ioctl_entry.part.0+0x7b/0xf0 [nvidia_uvm]
2023-11-14T12:00:44.240251+01:00 Pampelmuse kernel: [13278.122246] uvm_unlocked_ioctl_entry+0x6b/0x90 [nvidia_uvm]
2023-11-14T12:00:44.240251+01:00 Pampelmuse kernel: [13278.122307] __x64_sys_ioctl+0xa3/0xf0
2023-11-14T12:00:44.240251+01:00 Pampelmuse kernel: [13278.122315] do_syscall_64+0x5c/0x90
2023-11-14T12:00:44.240252+01:00 Pampelmuse kernel: [13278.122320] ? do_user_addr_fault+0x17a/0x6b0
2023-11-14T12:00:44.240252+01:00 Pampelmuse kernel: [13278.122326] ? srso_return_thunk+0x5/0x10
2023-11-14T12:00:44.240253+01:00 Pampelmuse kernel: [13278.122331] ? exit_to_user_mode_prepare+0x30/0xb0
2023-11-14T12:00:44.240253+01:00 Pampelmuse kernel: [13278.122338] ? srso_return_thunk+0x5/0x10
2023-11-14T12:00:44.240254+01:00 Pampelmuse kernel: [13278.122343] ? irqentry_exit_to_user_mode+0x17/0x20
2023-11-14T12:00:44.240254+01:00 Pampelmuse kernel: [13278.122349] ? srso_return_thunk+0x5/0x10
2023-11-14T12:00:44.240255+01:00 Pampelmuse kernel: [13278.122353] ? irqentry_exit+0x43/0x50
2023-11-14T12:00:44.240255+01:00 Pampelmuse kernel: [13278.122358] ? srso_return_thunk+0x5/0x10
2023-11-14T12:00:44.240256+01:00 Pampelmuse kernel: [13278.122363] ? exc_page_fault+0x94/0x1b0
2023-11-14T12:00:44.240256+01:00 Pampelmuse kernel: [13278.122369] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
2023-11-14T12:00:44.240256+01:00 Pampelmuse kernel: [13278.122376] RIP: 0033:0x7f81eef238ef
2023-11-14T12:00:44.240257+01:00 Pampelmuse kernel: [13278.122417] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
2023-11-14T12:00:44.240257+01:00 Pampelmuse kernel: [13278.122422] RSP: 002b:00007ffd10f0b940 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
2023-11-14T12:00:44.240277+01:00 Pampelmuse kernel: [13278.122428] RAX: ffffffffffffffda RBX: 00007f81eeb12840 RCX: 00007f81eef238ef
2023-11-14T12:00:44.240279+01:00 Pampelmuse kernel: [13278.122432] RDX: 00007ffd10f0b9e0 RSI: 0000000000000025 RDI: 000000000000000b
2023-11-14T12:00:44.240280+01:00 Pampelmuse kernel: [13278.122435] RBP: 00007ffd10f0ba40 R08: 00007f81eeb128d0 R09: 0000000000000000
2023-11-14T12:00:44.240280+01:00 Pampelmuse kernel: [13278.122438] R10: 0000560ef7117600 R11: 0000000000000246 R12: 0000560ef70ee456
2023-11-14T12:00:44.240280+01:00 Pampelmuse kernel: [13278.122441] R13: 00007f81eeb128d0 R14: 00007ffd10f0b9e0 R15: 000000000000000b
2023-11-14T12:00:44.240281+01:00 Pampelmuse kernel: [13278.122451]

More occurrences at:

uvm_mmu.c:536:51
uvm_pmm_gpu.c:2614:71
uvm_pmm_gpu.c:2044:63
uvm_pmm_gpu.c:2038:44
uvm_pmm_gpu.c:829:45
uvm_mmu.c:550:17
uvm_pmm_gpu.c:2364:28
uvm_pmm_gpu.c:746:68
uvm_pmm_gpu.c:857:39

The issue seems to be variable sized arrays written in old style, e.g. uvm_pmm_gpu.c:224:

uvm_gpu_chunk_t *subchunks[0];

Maybe someone should take a look at this: LKML: Linus Torvalds: Re: VLA removal (was Re: [RFC 2/2] lustre: use VLA_SAFE)

yeah,interesting. I have been able to fix it,but I don’t know why I don’t have that problem anymore. I remember more or less what I did. I’ve installed ubuntu 23.10 from scratch and then I’ve mixed together the fresh installation files of the new ubuntu 23.10 installation with the files of the older one. It worked. I’ve also realized that the error is shown when I perform an upgrade,from 23.04 to 23.10.