Series 550 freezes laptop

zullu · April 6, 2024, 7:51am

Same issue on MSI katana 15 B13V with arch linux 6.8.2

Here’s some logs

Here’s the output of

modeinfo nvidia-uvm

filename:       /lib/modules/6.8.2-arch2-1/extramodules/nvidia-uvm.ko.xz
version:        550.67
supported:      external
license:        Dual MIT/GPL
srcversion:     E8BAEAF83C32EBD2D30C349
depends:        nvidia
retpoline:      Y
name:           nvidia_uvm
vermagic:       6.8.2-arch2-1 SMP preempt mod_unload 
parm:           uvm_ats_mode:Set to 0 to disable ATS (Address Translation Services). Any other value is ignored. Has no effect unless the platform supports ATS. (int)
parm:           uvm_perf_prefetch_enable:uint
parm:           uvm_perf_prefetch_threshold:uint
parm:           uvm_perf_prefetch_min_faults:uint
parm:           uvm_perf_thrashing_enable:uint
parm:           uvm_perf_thrashing_threshold:uint
parm:           uvm_perf_thrashing_pin_threshold:uint
parm:           uvm_perf_thrashing_lapse_usec:uint
parm:           uvm_perf_thrashing_nap:uint
parm:           uvm_perf_thrashing_epoch:uint
parm:           uvm_perf_thrashing_pin:uint
parm:           uvm_perf_thrashing_max_resets:uint
parm:           uvm_perf_map_remote_on_native_atomics_fault:uint
parm:           uvm_disable_hmm:Force-disable HMM functionality in the UVM driver. Default: false (HMM is enabled if possible). However, even with uvm_disable_hmm=false, HMM will not be enabled if is not supported in this driver build configuration, or if ATS settings conflict with HMM. (bool)
parm:           uvm_perf_migrate_cpu_preunmap_enable:int
parm:           uvm_perf_migrate_cpu_preunmap_block_order:uint
parm:           uvm_global_oversubscription:Enable (1) or disable (0) global oversubscription support. (int)
parm:           uvm_perf_pma_batch_nonpinned_order:uint
parm:           uvm_cpu_chunk_allocation_sizes:OR'ed value of all CPU chunk allocation sizes. (uint)
parm:           uvm_leak_checker:Enable uvm memory leak checking. 0 = disabled, 1 = count total bytes allocated and freed, 2 = per-allocation origin tracking. (int)
parm:           uvm_force_prefetch_fault_support:uint
parm:           uvm_debug_enable_push_desc:Enable push description tracking (uint)
parm:           uvm_debug_enable_push_acquire_info:Enable push acquire information tracking (uint)
parm:           uvm_page_table_location:Set the location for UVM-allocated page tables. Choices are: vid, sys. (charp)
parm:           uvm_perf_access_counter_mimc_migration_enable:Whether MIMC access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm:           uvm_perf_access_counter_momc_migration_enable:Whether MOMC access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm:           uvm_perf_access_counter_batch_count:uint
parm:           uvm_perf_access_counter_threshold:Number of remote accesses on a region required to trigger a notification.Valid values: [1, 65535] (uint)
parm:           uvm_perf_reenable_prefetch_faults_lapse_msec:uint
parm:           uvm_perf_fault_batch_count:uint
parm:           uvm_perf_fault_replay_policy:uint
parm:           uvm_perf_fault_replay_update_put_ratio:uint
parm:           uvm_perf_fault_max_batches_per_service:uint
parm:           uvm_perf_fault_max_throttle_per_service:uint
parm:           uvm_perf_fault_coalesce:uint
parm:           uvm_fault_force_sysmem:Force (1) using sysmem storage for pages that faulted. Default: 0. (int)
parm:           uvm_perf_map_remote_on_eviction:int
parm:           uvm_block_cpu_to_cpu_copy_with_ce:Use GPU CEs for CPU-to-CPU migrations. (int)
parm:           uvm_exp_gpu_cache_peermem:Force caching for mappings to peer memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm:           uvm_exp_gpu_cache_sysmem:Force caching for mappings to system memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm:           uvm_downgrade_force_membar_sys:Force all TLB invalidation downgrades to use MEMBAR_SYS (uint)
parm:           uvm_channel_num_gpfifo_entries:uint
parm:           uvm_channel_gpfifo_loc:charp
parm:           uvm_channel_gpput_loc:charp
parm:           uvm_channel_pushbuffer_loc:charp
parm:           uvm_enable_va_space_mm:Set to 0 to disable UVM from using mmu_notifiers to create an association between a UVM VA space and a process. This will also disable pageable memory access via either ATS or HMM. (int)
parm:           uvm_enable_debug_procfs:Enable debug procfs entries in /proc/driver/nvidia-uvm (int)
parm:           uvm_peer_copy:Choose the addressing mode for peer copying, options: phys [default] or virt. Valid for Ampere+ GPUs. (charp)
parm:           uvm_debug_prints:Enable uvm debug prints. (int)
parm:           uvm_enable_builtin_tests:Enable the UVM built-in tests. (This is a security risk) (int)
parm:           uvm_release_asserts:Enable uvm asserts included in release builds. (int)
parm:           uvm_release_asserts_dump_stack:dump_stack() on failed UVM release asserts. (int)
parm:           uvm_release_asserts_set_global_error:Set UVM global fatal error on failed release asserts. (int)
parm:           uvm_conf_computing_channel_iv_rotation_limit:ulong

Tarballwalf · April 7, 2024, 8:23am

Here I come once again with another update. I’ve had to continue using 550.67 in order to play games like Forza Horizon 5, as they would straight up crash when using the older 545 driver (I’m just going to ignore the fact that I still get graphical glitches and huge fps dips, just happy that the game actually runs).

Due to using the new driver and it causing kernel panics every time I update, restart or boot, it had actually corrupted the file system and will soon require a reinstall or btrfs check --repair, wasting me even more time backing up and setting up the system the 3rd time in 6 months! (before I had gotten any NVIDIA stuff, all my systems where rock solid).

In the meantime, after upgrading to the latest 6.8.4 kernel, my logs have gotten into the gigabytes due to them being spammed with this error message constantly

Apr 05 17:15:27 TUF-F15 kernel: [drm:__nv_drm_gem_nvkms_memory_prime_get_sg_table [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Cannot create sg_table for NvKmsKapiMemory 0x000000001bf50278
Apr 05 17:15:27 TUF-F15 kernel: [drm:__nv_drm_gem_nvkms_memory_prime_get_sg_table [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Cannot create sg_table for NvKmsKapiMemory 0x000000001bf50278
Apr 05 17:15:27 TUF-F15 kernel: [drm:__nv_drm_gem_nvkms_memory_prime_get_sg_table [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Cannot create sg_table for NvKmsKapiMemory 0x000000001bf50278

I think these only happened while actually playing those game. Anyhow, here’s another kernel log and bug report in hopes that we’ll get a fix update before the beta 555. As waiting till May, and possibly even more since it’s only a beta driver, god knows how much it will take till a “stable” release, with a broken/incomplete system (not being able to upgrade the kernel due to using old drives) is just becoming a big joke.

nvidia-bug-report.log.gz (2.8 MB)
kernel.log (7.1 KB)

eric.esteban28 · April 7, 2024, 6:42pm

I imagine that we can only wait to see if it has been corrected in the next version 555 or whatever.

More than a month since I reported the error, although I already warned in another thread when they released version 550 beta and there is still no solution.

I think arch is quite important but it seems that nvidia doesn’t think the same.

bakabo8 · April 8, 2024, 8:13am

Would be nice to get a status update from Nvidia.

mario156090 · April 16, 2024, 12:23am

Nvidia, We’re waiting for you…

0xwojak · April 17, 2024, 1:22pm

In my opinion this this should be treated as P1. It affects every laptop with RTX 20xx, 30xx, 40xx serries. It makes laptops unusable.

It often bricks the systems as it completely crashes system during “Reloading system manager configuration…”. In many cases solution is to re-install OS (arch linux in my case). As dependencies are messed up because of the crash during system upgrade.

Best solution in my case is Nouveau. Which means no CUDA 😡.

Number of reports is mere fraction. This is not just a crash, it bricks systems and affects every RTX 20xx, 30xx, 40xx laptop. I am begging you please escalate this issue 🙏🙏🙏

PS possibly related:

Tarballwalf · April 17, 2024, 5:31pm

Checking the UNIX drivers page, it seems that they released a new version (550.76). Upon checking the logs, there’s only one entry regarding driver initializing on RHEL systems. I have my hopes that they silently fixed the issue on other systems. I will come with an update after I jankingly upgrade the driver (after it is available in the arch testing repos).

james289 · April 17, 2024, 8:42pm

It’s bewildering to me that this hasn’t been fixed yet. I understand that in the past nvidia was primarily a gaming company and linux users were just a small niche that didn’t bring in much revenue. But those times have changed.

Nowadays nvidia has a market cap of more than $2 trillion, primarily due to usage of its hardware and software stack within the machine learning space. I would imagine that most of those engineers and researchers are using linux, and certainly most of the nvidia gpus being provisioned in the cloud are running on linux.

This is no longer a case of a small group of niche users asking for some crumbs… nvidia is now alienating their biggest customers.

0xwojak · April 17, 2024, 8:53pm

Could you please tell us weather there was any progress on this ?

In my opinion this this should be treated as P1. This is not just a crash, it bricks systems and affects every RTX 20xx, 30xx, 40xx laptop. I am begging you please escalate this issue 🙏🙏🙏

PS possibly related:

patrick4242 · April 18, 2024, 2:06am

It’s been a month and a half since this issue was reported. Some of us have been apt to point out that NVDIA should care about its customer base on Linux due to the current prevalence of AI and machine learning on GPUs, but I am starting to suspect that this is exactly why they don’t care about Linux with the RTX series cards. If you are a hobbyist training AI models on a desktop graphics card, NVIDIA doesn’t care about you. They care about their customers who will buy the datacenter-class GPUs like the A100 or L40. These cost close to $10k for a single card.

If NVIDIA provided stellar support for RTX GPUs on Linux, AI startups might be tempted to use these for training their models in a Linux cluster instead of buying the datacenter GPUs or relying on a cloud provider that uses the datacenter GPUs. In addition, it’s likely that NVIDIA’s internal teams with Linux knowledge are largely working on the drivers for datacenter GPUs and not RTX GPUs. It simply brings in more revenue for them to do so.

I’ve been happy with Nouveau since installing it with 0 crashes to date for a 3060 RTX laptop card and plan to continue using it. Who knew that OSS would be more reliable than closed-source? ;)

patrick4242 · April 18, 2024, 2:09am

This is the difficult lesson

Tarballwalf · April 18, 2024, 3:46am

Driver crashed during a kernel update… (still on 550.67)

Tarballwalf · April 18, 2024, 3:51am

oh I cannot wait for Nouveau and NVK to have on par performance with the closed source driver. As soon as those are ready I’m fucking deleting the entire NVIDIA closed source stack. Never have I had a single issue with any Linux systems until I got NVIDIA’s packages on them.

ionen · April 18, 2024, 4:25am

Why not go back to the 535 branch while waiting? The previous production branch is currently still supported and getting security&bugfix releases.

Regressions like this is partly why these keep being supported until things settle down. There should be no need to wait for a fix while having your system crash constantly.

Many linux distros typically keep multiple branches available for this purpose (plus the legacy ones for old hardware).

Tarballwalf · April 18, 2024, 5:18am

Forza Horizon 5 only works on 550. upgrading and downgrading each time I want to play becomes a chore. Plus I’m stuck on the older 6.7.4 (I believe) kernel, incomplete system.

0xwojak · April 18, 2024, 1:18pm

If only it was that simple. Truth is if you wanna do any compute on GPU, especially anything to do with AI. CUDA is where it is at.

AMD is far far behind. Industry uses Nvidia. I know geohotz have been trying to write something to make deep learning on AMD closer to Nvidia. I don’t think it took of (yet).

This sort of system bricking bug should never have been released in the first place. I mean it affects every single laptop released in last 4 years. Are they doing no QA at all ?

SHOULD BE TREATED AS P1 😭🙏

james289 · April 18, 2024, 1:25pm

Has anybody tested 550.76 to see if the issue is fixed?

Saltyming · April 18, 2024, 2:04pm

By the way, the new 535 driver supports linux kernel 6.8. You can at least upgrade the kernel now.

Saltyming · April 18, 2024, 2:35pm

550.76 boots fine here in Arch (after editing PKGBUILD and building the package manually.) Will test the issue whether it still occurs…

EDIT: The panic happened, guessing it’s not fixed.

EDIT2: I also tried the nvidia-open driver and got interestingly similar kernel panic but seems more readable. I will attach system journal for people who want to investigate it…

open_kernel.log (867.2 KB)

cubicmobile · April 18, 2024, 4:23pm

I have the same issue on Acer Nitro AN515-51 laptop with GTX 1050M. Kernel panic occurred at rb_first or zswap_load a few minutes after I loaded NVIDIA >= 550.54.14. I haven’t test 550.76 yet.

Currently running Arch Linux with 6.8.5 kernel and NVIDIA beta 550.40.07 from AUR (commit 4aa19ade)) which works fine. I used 545.29.06 previously but it didn’t compile with 6.8 kernel.