Kernel softlock with driver 535.54.03 when closing Vulkan apps

Hi,

I recently installed driver version 535.54.03 on Ubuntu 20.04 (kernel 5.4.0-153-generic). Since then, my system occasionally softlocks when closing Vulkan apps (i.e., the whole system becomes unresponsive). This is not consistently reproducible and happens quite rarely, but still regularly enough to be annoying to deal with.

According to the call trace in kern.log below, the softlock happens somewhere in the NVIDIA driver.

Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.553730] watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [Correrender:61157]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.553732] Modules linked in: dm_crypt rfcomm xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp vboxnetadp(OE) vboxnetflt(OE) ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter bpfilter bridge stp llc vboxdrv(OE) aufs cmac algif_hash algif_skcipher af_alg bnep overlay nvidia_uvm(OE) nvidia_drm(POE) nvidia_modeset(POE) snd_hda_codec_hdmi nvidia(POE) nls_iso8859_1 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio iwlmvm mac80211 snd_hda_intel uvcvideo snd_intel_dspcfg snd_hda_codec libarc4 videobuf2_vmalloc snd_hda_core edac_mce_amd videobuf2_memops snd_usb_audio videobuf2_v4l2 videobuf2_common kvm_amd snd_seq_midi videodev snd_usbmidi_lib joydev kvm snd_hwdep snd_seq_midi_event mc btusb snd_rawmidi btrtl input_leds btbcm crct10dif_pclmul btintel ghash_clmulni_intel binfmt_misc snd_pcm snd_seq bluetooth iwlwifi snd_seq_device snd_timer
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.553757]  aesni_intel ecdh_generic crypto_simd drm_kms_helper snd ecc cryptd fb_sys_fops cfg80211 wmi_bmof k10temp soundcore ccp glue_helper syscopyarea sysfillrect sysimgblt mac_hid sch_fq_codel msr parport_pc ppdev lp parport ramoops drm reed_solomon efi_pstore ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_logitech_hidpp hid_logitech_dj hid_generic igb usbhid uas usb_storage hid crc32_pclmul i2c_algo_bit i2c_piix4 nvme dca ahci nvme_core libahci wmi gpio_amdpt gpio_generic
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.553777] CPU: 6 PID: 61157 Comm: Correrender Tainted: P           OE     5.4.0-153-generic #170-Ubuntu
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.553778] Hardware name: Micro-Star International Co., Ltd. MS-7B85/B450 GAMING PRO CARBON AC (MS-7B85), BIOS 1.F6 09/30/2021
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.553911] RIP: 0010:_nv039537rm+0x3b/0x80 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.553912] Code: d3 89 de 48 8d 55 0f c6 45 0f 00 e8 3f 4c 60 ff 80 7d 0f 00 41 89 c4 75 11 41 39 5d 10 76 20 49 8b 45 00 c1 eb 02 44 8b 24 98 <5b> 44 89 e0 41 5c 41 5d 48 83 c5 10 c3 0f 1f 84 00 00 00 00 00 be
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.553913] RSP: 0018:ffffb8a6813b78c0 EFLAGS: 00200216 ORIG_RAX: ffffffffffffff13
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.553914] RAX: ffffb8a691000000 RBX: 00000000002e0405 RCX: 0000000000b81014
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.553914] RDX: ffff9c042b7f289f RSI: 0000000000b81014 RDI: ffff9c09d9db8008
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.553915] RBP: ffff9c042b7f2890 R08: 0000000000000020 R09: 0000000000000000
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.553915] R10: 0000000000b81014 R11: ffff9c042b7f29c8 R12: 0000000000000002
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.553915] R13: ffff9c09d9db8bc8 R14: 0000000000000000 R15: 0000000000000000
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.553916] FS:  00007fffa90c9000(0000) GS:ffff9c09fe980000(0000) knlGS:0000000000000000
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.553916] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.553917] CR2: 00007ff768000010 CR3: 0000000e8ed0c000 CR4: 0000000000340ee0
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.553917] Call Trace:
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.554089]  ? _nv013076rm+0x10f/0x170 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.554258]  ? _nv030427rm+0xb8/0xe0 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.554426]  ? _nv030452rm+0xa0/0x2d0 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.554593]  ? _nv030453rm+0x5b/0x1d0 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.554758]  ? _nv030454rm+0x2d/0x110 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.554924]  ? _nv030546rm+0x13f/0x340 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.555089]  ? _nv030547rm+0x50/0x60 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.555245]  ? _nv013174rm+0x86/0xc0 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.555400]  ? _nv013170rm+0x3a4/0x400 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.555553]  ? _nv044237rm+0xd1/0x1b0 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.555720]  ? _nv041109rm+0x1e7/0x370 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.555819]  ? _nv048377rm+0x40/0x95 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.555972]  ? _nv035020rm+0x14d/0x2e0 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.556071]  ? _nv048374rm+0xc5/0x460 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.556196]  ? _nv002711rm+0xd/0x20 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.556320]  ? _nv004074rm+0x19/0xb0 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.556446]  ? _nv016053rm+0x51c/0x620 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.556546]  ? _nv043216rm+0xab/0xe0 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.556671]  ? _nv044933rm+0xac/0x130 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.556795]  ? _nv044932rm+0x3e5/0x690 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.556896]  ? _nv043119rm+0xd5/0x160 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.557004]  ? _nv043120rm+0x41/0x70 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.557105]  ? _nv000566rm+0x4d/0x60 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.557222]  ? _nv000714rm+0x1b7/0xe70 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.557337]  ? rm_ioctl+0x58/0xb0 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.557426]  ? nvidia_ioctl+0x6f0/0x850 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.557429]  ? get_max_files+0x20/0x20
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.557518]  ? nvidia_frontend_unlocked_ioctl+0x3b/0x50 [nvidia]
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.557520]  ? do_vfs_ioctl+0x407/0x670
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.557521]  ? ksys_ioctl+0x67/0x90
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.557522]  ? __x64_sys_ioctl+0x1a/0x20
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.557524]  ? do_syscall_64+0x57/0x190
Jul  5 15:10:12 christoph-MS-7B85 kernel: [21184.557526]  ? entry_SYSCALL_64_after_hwframe+0x5c/0xc1

I have also included nvidia-bug-report.log.gz after restarting the system after the last crash, but it does not seem to contain useful information. The name ‘Correrender’, which appears in kern.log, is the name of the Vulkan application.

nvidia-bug-report.log.gz (385.3 KB)

1 Like

@chrismile
Thanks for reporting issue to us., we have a bug 4175666 filed internally for tracking purpose.
Would you please help me to share exact repro steps so that I can try the same to recreate issue locally.

Unfortunately, I’m also not able to consistently trigger the kernel softlock. There’s a few things I found out about this problem:

  • The driver kernel module only ever softlocks at program termination.
  • The softlock is not deterministic (i.e., it does not happen every time the program is executed).
  • When I add debug printing to a file (with flushing to make sure nothing written to the file is lost due to the kernel softlock), it seems like the NVIDIA driver kernel module always softlocks during a call to vkFreeMemory called by the Vulkan Memory Allocator (VMA) library.
  • This problem never occured on driver version 530.41.03, which I ran during the last months. It also never happened on any version prior to that.
  • I did not manage to reproduce it with simpler Vulkan programs, but only with an application created by myself. I’m not even sure if this is related, but the application uses Vulkan-OpenGL interop (an offscreen OpenGL context is created with EGL as outlined in https://developer.nvidia.com/blog/egl-eye-opengl-visualization-without-x-server/, and buffers, images and semaphores are shared between Vulkan and OpenGL via VK_KHR_external_memory_fd and VK_KHR_external_semaphore_fd).
  • The application runs perfectly fine on an Intel GPU and an AMD GPU.

Unfortunately, it is not easy to share the application for reproduction. It is a data visualization software, and I fear it might be far too big to serve as a minimal test case. Unfortunately, it’s really hard to create a minimalistic application that triggers the kernel softlock. On the one hand, it is not even happening deterministically, so it is hard to say if it not happening in a simpler program is just a question of pure luck/less memory allocations. On the other hand, each time it is triggered, the whole kernel softlocks and the system needs to be restarted, which complicates debugging this problem. I hope the information above together with the kernel call trace is helpful to track down the error source in the driver. If not, please feel free to tell me.

I ran the application with Valgrind memcheck. The output in the image below (when the program didn’t softlock the whole system) definitely did not exist when I ran the application with an older driver version, but of course this does not mean it is necessarily related to the problem.

Unfortunately, I have not been able to create a minimalistic test application for the kernel soft lockup. It seems like it disappears when I disable all Vulkan-OpenGL interop features of my application. On the other hand, the crash appears in a call to vkFreeMemory not associated with Vulkan-OpenGL interop. And as the crash isn’t deterministic to begin with, I cannot say if this actually fixed it or just changed the memory layout in such a way that this bug is no longer triggered. It seems like the malloc call with high negative size value in the NVIDIA driver remains, but I wasn’t really able to poinpoint what exactly causes it. It is not present in a simple test program like vkcubes.

The problem should not be related to Vulkan-OpenGL interop. I tried using Zink (Zink — The Mesa 3D Graphics Library latest documentation) as a replacement for the NVIDIA OpenGL driver, and the kernel module softlock still occasionally happens.

hi, I’m having the same issue, is this a driver bug?