After upgrading from (K)Ubuntu 23.04 to (K)Ubuntu 23.10 I started getting a lot of “UBSAN: array-index-out-of-bounds” complaints from the kernel. See attached logs for both 535.113 and 535.129.
kern-535.129.log (46.4 KB)
kern-535.113.log (78.0 KB)
I get this too. Also for the 545 driver.
I’m also getting this.
Ubuntu 23.10, Linux 6.5.0-10-generic, Nvidia driver 535.129.03-0ubuntu0.23.10.1, Quadro P400.
ubsan_dmesg.txt (167.2 KB)
lshw.txt (37.6 KB)
Relevant: The Undefined Behavior Sanitizer - UBSAN — The Linux Kernel documentation
Same exact thing on Ubuntu 23.10 with the 6.5.0-13-generic Kernel and the Nvidia 545.29.06 driver
[ 14.267145] ================================================================================
[ 14.267148] UBSAN: array-index-out-of-bounds in /var/lib/dkms/nvidia/545.29.06/build/nvidia-uvm/uvm_pmm_gpu.c:2364:28
[ 14.267149] index 0 is out of range for type ‘uvm_gpu_chunk_t []’
[ 14.267150] CPU: 6 PID: 2641 Comm: gst-plugin-scan Tainted: P OE 6.5.0-13-generic #13-Ubuntu
[ 14.267152] Hardware name: ASUS System Product Name/ROG MAXIMUS Z790 HERO, BIOS 1501 10/06/2023
[ 14.267153] Call Trace:
[ 14.267154]
[ 14.267156] dump_stack_lvl+0x48/0x70
[ 14.267163] dump_stack+0x10/0x20
[ 14.267164] __ubsan_handle_out_of_bounds+0xc6/0x110
[ 14.267167] split_gpu_chunk+0x13f/0x410 [nvidia_uvm]
[ 14.267200] uvm_pmm_gpu_alloc+0x2da/0x6d0 [nvidia_uvm]
[ 14.267224] phys_mem_allocate+0xac/0x230 [nvidia_uvm]
[ 14.267253] allocate_directory+0xb4/0x130 [nvidia_uvm]
[ 14.267279] ? allocate_directory+0xb4/0x130 [nvidia_uvm]
[ 14.267303] uvm_page_tree_init+0x12c/0x2e0 [nvidia_uvm]
[ 14.267329] uvm_gpu_retain_by_uuid+0x1a2b/0x2bb0 [nvidia_uvm]
[ 14.267351] uvm_va_space_register_gpu+0x47/0x740 [nvidia_uvm]
[ 14.267372] uvm_api_register_gpu+0x5a/0x90 [nvidia_uvm]
[ 14.267393] uvm_ioctl+0x1a26/0x1cd0 [nvidia_uvm]
[ 14.267411] ? ext4_inode_block_valid+0x1d/0x30
[ 14.267414] ? __ext4_ext_check+0x1ff/0x500
[ 14.267416] ? unlock_new_inode+0x55/0x70
[ 14.267417] ? __ext4_iget+0x9d1/0x1130
[ 14.267419] ? __d_add+0x118/0x1e0
[ 14.267420] ? _raw_spin_lock_irqsave+0xe/0x20
[ 14.267422] ? thread_context_non_interrupt_add+0x13a/0x2c0 [nvidia_uvm]
[ 14.267448] uvm_unlocked_ioctl_entry.part.0+0x7b/0xf0 [nvidia_uvm]
[ 14.267468] uvm_unlocked_ioctl_entry+0x6b/0x90 [nvidia_uvm]
[ 14.267487] __x64_sys_ioctl+0xa0/0xf0
[ 14.267488] do_syscall_64+0x59/0x90
[ 14.267490] ? exit_to_user_mode_prepare+0x30/0xb0
[ 14.267493] ? syscall_exit_to_user_mode+0x37/0x60
[ 14.267494] ? do_syscall_64+0x68/0x90
[ 14.267495] ? irqentry_exit+0x43/0x50
[ 14.267496] ? exc_page_fault+0x94/0x1b0
[ 14.267498] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 14.267500] RIP: 0033:0x7f83209238ef
[ 14.267525] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 14.267526] RSP: 002b:00007ffcadf061c0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 14.267527] RAX: ffffffffffffffda RBX: 00007f831fd12840 RCX: 00007f83209238ef
[ 14.267528] RDX: 00007ffcadf06260 RSI: 0000000000000025 RDI: 000000000000000e
[ 14.267528] RBP: 00007ffcadf062c0 R08: 00007f831fd128d0 R09: 0000000000000000
[ 14.267529] R10: 00007f83208143c0 R11: 0000000000000246 R12: 000055f14f125f76
[ 14.267529] R13: 00007f831fd128d0 R14: 00007ffcadf06260 R15: 000000000000000e
[ 14.267530]
[ 14.267531] ================================================================================
This block of code just repeats in different patterns and points to different lines of code in the nvidia file that it mentions in the beginning.
We have a bug 4348950 internally filed for tracking purpose.
Issue has been already root caused and will be available in future branch release drivers.
I should note that this warning is harmless. It’s due to the wrong size being declared on some arrays in UVM. You can see it in the code here: https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/kernel-open/nvidia-uvm/uvm_pmm_gpu.c#L224
These are C “flexible array members” and should just be declared with no size.
I am on Ubuntu 23.04 server and I am using Windows on VM with QEMU/KVM using GPU passthrough and I am using Nvidia A4000 GPU. Its crashing the entite host kernel at random times while playing demanding games and I tried to capture the log with kdump and this is what I got: Question #708640 “QEMU/KVM crashes with GPU Passthrough at rando...” : Questions : Ubuntu
The error I face is similar to what is said in this forum? Is the UBSAN crashing my host kernel?
This happens only when I am playing games on the windows VM. I gave 16 cores and 16G RAM to my VM. The resources are all under the limit, yet this crash happens that crashes the entire host kernel.
Also seeing this
Linux data 6.5.0-14-generic #14-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 14 14:59:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Driver Version: 545.29.06
Thank you so much @aplattner