Driver crashes after upgrade from 510.68.02 to 515.48.07 on GTX 1080ti

It’s on Arch Linux with stock kernel 5.18.3. It doesn’t matter if driver is loaded after system boot or during boot.
On 510.68.02 everything worked fine.

nvidia-bug-report.sh script never finishes, I attached partial output.
nvidia-bug-report.log.gz (766.7 KB)

Relevant dmesg output:

[ 7755.567880] nvidia 0000:14:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 7755.698996] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  515.48.07  Fri May 27 03:18:00 UTC 2022
[ 7755.700096] [drm] [nvidia-drm] [GPU ID 0x00001400] Loading driver
[ 7756.351783] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:14:00.0 on minor 1
[ 7756.362363] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[ 7756.370912] nvidia-uvm: Loaded the UVM driver, major device number 503.
[ 7768.396285] nvidia-modeset: ERROR: GPU:0: Display engine push buffer channel allocation failed: 0x65 (Call timed out [NV_ERR_TIMEOUT])
[ 7768.410230] nvidia-modeset: ERROR: GPU:0: Failed to allocate display engine core DMA push buffer
[ 7784.451875] BUG: kernel NULL pointer dereference, address: 0000000000000070
[ 7784.451880] #PF: supervisor read access in kernel mode
[ 7784.451881] #PF: error_code(0x0000) - not-present page
[ 7784.451882] PGD 0 P4D 0 
[ 7784.451884] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 7784.451886] CPU: 8 PID: 123052 Comm: Xorg Tainted: P        W  OE     5.18.3-arch1-1 #1 2090c6f1d9d20f39bd14c0acb6fa89ddb994d43f
[ 7784.451889] Hardware name: ASUS System Product Name/ROG CROSSHAIR VIII HERO (WI-FI), BIOS 4201 04/26/2022
[ 7784.451890] RIP: 0010:_nv002521kms+0x18/0x70 [nvidia_modeset]
[ 7784.451913] Code: ff c6 44 24 2f 01 eb af 66 2e 0f 1f 84 00 00 00 00 00 41 54 55 49 89 fc 53 89 d5 41 b8 04 00 00 00 ba 02 01 02 00 48 83 ec 10 <8b> 46 70 8b 3d cf cd 0f 00 48 8d 4c 24 0c 89 ee 89 44 24 0c e8 bf
[ 7784.451915] RSP: 0018:ffffb78162cefc30 EFLAGS: 00010286
[ 7784.451916] RAX: 0000000000000000 RBX: 0000000020020000 RCX: 0000000000006c08
[ 7784.451918] RDX: 0000000000020102 RSI: 0000000000000000 RDI: ffff9f69a0fc7008
[ 7784.451919] RBP: 0000000000010009 R08: 0000000000000004 R09: 00000000fffffffe
[ 7784.451919] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9f69a0fc7008
[ 7784.451921] R13: ffff9f69a0fc70a0 R14: 0000000000000fff R15: 0000000000010008
[ 7784.451922] FS:  00007fafe5f0e100(0000) GS:ffff9f6f2ec00000(0000) knlGS:0000000000000000
[ 7784.451923] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7784.451924] CR2: 0000000000000070 CR3: 0000000a66862000 CR4: 0000000000350ee0
[ 7784.451926] Call Trace:
[ 7784.451927]  <TASK>
[ 7784.451929]  _nv002520kms+0xb3/0x150 [nvidia_modeset 658855af3f998e4c93b01ca095c666f654e238c7]
[ 7784.451948]  _nv002294kms+0x4da/0x720 [nvidia_modeset 658855af3f998e4c93b01ca095c666f654e238c7]
[ 7784.451967]  ? __check_object_size+0x143/0x160
[ 7784.451971]  ? _nv000448kms+0xa0/0xa0 [nvidia_modeset 658855af3f998e4c93b01ca095c666f654e238c7]
[ 7784.451987]  _nv000633kms+0x34/0x50 [nvidia_modeset 658855af3f998e4c93b01ca095c666f654e238c7]
[ 7784.452003]  nvKmsIoctl+0x96/0x1d0 [nvidia_modeset 658855af3f998e4c93b01ca095c666f654e238c7]
[ 7784.452019]  nvkms_ioctl+0x118/0x180 [nvidia_modeset 658855af3f998e4c93b01ca095c666f654e238c7]
[ 7784.452036]  nvidia_frontend_unlocked_ioctl+0x3c/0x50 [nvidia cdb2ec842bb5797c80be4ec5b9ce9c833bd49f74]
[ 7784.452259]  __x64_sys_ioctl+0x91/0xc0
[ 7784.452262]  do_syscall_64+0x5f/0x90
[ 7784.452265]  ? __x64_sys_ioctl+0x91/0xc0
[ 7784.452267]  ? syscall_exit_to_user_mode+0x26/0x50
[ 7784.452268]  ? do_syscall_64+0x6b/0x90
[ 7784.452270]  ? nvidia_frontend_unlocked_ioctl+0x3c/0x50 [nvidia cdb2ec842bb5797c80be4ec5b9ce9c833bd49f74]
[ 7784.452479]  ? __x64_sys_ioctl+0x91/0xc0
[ 7784.452481]  ? syscall_exit_to_user_mode+0x26/0x50
[ 7784.452482]  ? do_syscall_64+0x6b/0x90
[ 7784.452484]  ? do_syscall_64+0x6b/0x90
[ 7784.452485]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 7784.452487] RIP: 0033:0x7fafe68727af
[ 7784.452489] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 7784.452490] RSP: 002b:00007fffacc80fb0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 7784.452492] RAX: ffffffffffffffda RBX: 00000000c0106d00 RCX: 00007fafe68727af
[ 7784.452493] RDX: 00007fffacc81010 RSI: 00000000c0106d00 RDI: 0000000000000012
[ 7784.452494] RBP: 00007fffacc81010 R08: 00007fffacc805a0 R09: 00007fffacc805bc
[ 7784.452495] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000012
[ 7784.452495] R13: 00007fffacc81060 R14: 0000558a47236530 R15: 00007fafe528e880
[ 7784.452497]  </TASK>
[ 7784.452498] Modules linked in: nvidia_uvm(POE) nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) cdc_acm rfcomm snd_seq_dummy snd_hrtimer snd_seq overlay wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel bridge stp llc nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 cmac algif_hash algif_skcipher nf_tables af_alg nct6775 hwmon_vid nfnetlink bnep lm92 xfs sch_cake iwlmvm snd_hda_codec_realtek intel_rapl_msr vboxnetflt(OE) intel_rapl_common mac80211 snd_hda_codec_generic vboxnetadp(OE) libarc4 ledtrig_audio snd_hda_codec_hdmi snd_hda_intel edac_mce_amd btusb vboxdrv(OE) snd_intel_dspcfg snd_usb_audio btrtl snd_intel_sdw_acpi pkcs8_key_parser btbcm snd_usbmidi_lib iwlwifi snd_hda_codec btintel pktcdvd asus_ec_sensors i2c_dev uvcvideo kvm_amd btmtk snd_rawmidi eeepc_wmi iwlmei snd_hda_core videobuf2_vmalloc videobuf2_memops snd_seq_device bluetooth snd_hwdep videobuf2_v4l2 asus_wmi snd_pcm cfg80211
[ 7784.452531]  videobuf2_common kvm sparse_keymap r8169 snd_timer videodev ecdh_generic realtek platform_profile igb sp5100_tco vfat snd joydev mdio_devres rfkill mousedev fat rapl mc video pcspkr wmi_bmof mxm_wmi k10temp i2c_piix4 soundcore crc16 libphy mei dca mac_hid pinctrl_amd acpi_cpufreq dm_multipath kvmfr(OE) ipmi_devintf ipmi_msghandler sg crypto_user fuse lzo_rle zram ip_tables x_tables btrfs blake2b_generic xor raid6_pq libcrc32c crc32c_generic dm_crypt cbc encrypted_keys trusted asn1_encoder tee hid_logitech_hidpp hid_logitech_dj dm_mod amdgpu drm_ttm_helper uas usb_storage ttm usbhid crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel tpm_crb crypto_simd gpu_sched nvme sr_mod cryptd tpm_tis ccp drm_dp_helper xhci_pci tpm_tis_core cdrom nvme_core xhci_pci_renesas tpm wmi rng_core vfio_pci vfio_pci_core irqbypass vfio_virqfd vfio_iommu_type1 vfio tcp_bbr
[ 7784.452571] CR2: 0000000000000070
[ 7784.452573] ---[ end trace 0000000000000000 ]---
[ 7784.452573] RIP: 0010:_nv002521kms+0x18/0x70 [nvidia_modeset]
[ 7784.452592] Code: ff c6 44 24 2f 01 eb af 66 2e 0f 1f 84 00 00 00 00 00 41 54 55 49 89 fc 53 89 d5 41 b8 04 00 00 00 ba 02 01 02 00 48 83 ec 10 <8b> 46 70 8b 3d cf cd 0f 00 48 8d 4c 24 0c 89 ee 89 44 24 0c e8 bf
[ 7784.452593] RSP: 0018:ffffb78162cefc30 EFLAGS: 00010286
[ 7784.452594] RAX: 0000000000000000 RBX: 0000000020020000 RCX: 0000000000006c08
[ 7784.452595] RDX: 0000000000020102 RSI: 0000000000000000 RDI: ffff9f69a0fc7008
[ 7784.452596] RBP: 0000000000010009 R08: 0000000000000004 R09: 00000000fffffffe
[ 7784.452596] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9f69a0fc7008
[ 7784.452597] R13: ffff9f69a0fc70a0 R14: 0000000000000fff R15: 0000000000010008
[ 7784.452598] FS:  00007fafe5f0e100(0000) GS:ffff9f6f2ec00000(0000) knlGS:0000000000000000
[ 7784.452599] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7784.452600] CR2: 0000000000000070 CR3: 0000000a66862000 CR4: 0000000000350ee0

Maybe you encountered this?

Don’t mind the repo name, happens with both prop and open source kernel driver on 5.18 kernel.

I have AMD CPU, not Intel, so this is not applicable for me.

I am seeing the same crash with 5.18.3 kernel
Trying to get A5500 working under linux which requires 515 as minimum.

I had a similar problem upgrading to my own 5.11.8 kernel. nvidia-installer builds .ko “fine” then a segfault when nvidia.ko is loaded. Infact i wrote on nvidia forum i thought it was a gcc-10.2 issue. I found next ALSA also was loading but not responding… I knew my old modules still loaded in the “new kernel” however.

I consider the matter as a whole unresolved: why would the ACL kernel accept non-acl modules and fail on ACL ones?

That’s right! When I enabled EXT4 with ACL and XATTR both nvidia and alsa stopped working.

I did “make modules” with #undef CONFIG_FS_POSIX_ACL place in core/foo.c and results: offset 0x34 seen in ASM dump (it was byte for bye same as working modules again).

I’m recompiling as I speak with ACL support NOT in the kernel. rh.

I can do that because I compile my whole OS from scratch (totally built linux distribution tbld). Not everyone can turn ACL off - ie ubuntu cannot. but ubuntu isn’t reporting the problem (i assume they have a patch not public in kernel?)

. diff -dwr /tmp/hex1/core/snd.ko /tmp/hex2/core/snd.ko
. 28c28
. < 00001b0 4451 678b 4434 e389 e381 ffff 000f f741
. —
. > 00001b0 4451 678b 4444 e389 e381 ffff 000f f741

. objd sound/core/snd.ko
. 0000000000002a23 <snd_ctl_open>:
. 4503c4503
. < 2a46: 8b 7b 34 mov 0x34(%rbx),%edi
. —
. > 2a46: 8b 7b 44 mov 0x44(%rbx),%edi

sound/core/control.o (objdump is similar to the above)

control.c:
in function:
static int snd_ctl_open(struct inode *inode, struct file *file)

offending line:
card = snd_lookup_minor_data(iminor(inode), SNDRV_DEVICE_TYPE_CONTROL);

/* iminor(inode) = 0x34 or 0x44 as it should , inode->rdev */

include/linux/fs.h
static inline unsigned iminor(const struct inode *inode)
return MINOR(inode->i_rdev);


Remember i said: the old 0x34 modules works both in the ACL and non-ACL booted kernel. While my nvidia is working well now - i am confused.

AVANCED: this is NOT part of any solution, just investigation.

./linux/kdev_t.h:#define MINOR(dev) ((unsigned int) ((dev) & MINORMASK))

./uapi/linux/kdev_t.h:#define MINOR(dev) ((dev) & 0xff)

#define MINORMASK ((1U << MINORBITS) - 1)
#define MINORBITS 20

(i’m still hunting for 1U)

the above, to me, looks “like a hack, but ok” if, and only if, ACL weren’t hacked i above possibly around offset 20 bumping rdev down, BUT IT IS, and SECURITY is not right after (orphans left in-between also). my idea is ACL SECURIY should be at end of inode or even a separate struct so that these things weren’t a problem. perhaps someone knew that. (the fact is there are existing modules that should still work - they should have known that). another fact is .h header often have to use “likely hacks” since kernel headers don’t publish “all and everything”. that’s all opinion.