Communication Issue with NVIDIA Driver on Ubuntu After Kernel Update (RTX 3050 Passthrough)

Hello everyone,

I have an Ubuntu Server 24.04.1 LTS VM running on Proxmox 8.2.7, in which I’ve added my MSI GeForce RTX™ 3050 LP 6G OC graphics card in passthrough as a PCI Device (hostpci0) using the Raw Device feature with Device ID 0000:01:00.0. I’ve enabled all necessary features, checked ROM-Bar and PCI-Express. Secure Boot is disabled on my server’s motherboard.

I installed the driver for my RTX 3050 via the following repository:

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update 
ubuntu-drivers devices  
sudo apt install nvidia-550

After installation, nvidia-smi worked perfectly, and the information was properly displayed.

However, after a kernel update during a system upgrade, I encountered the following error upon reboot:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I tried reinstalling the driver using the package directly from NVIDIA’s website with this command:

./sh NVIDIA-Linux-x86_64-550.120.run

But I received the following error:

ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if a driver such as rivafb, nvidiafb, or nouveau is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA graphics device(s), or no NVIDIA GPU installed in this system is supported by this NVIDIA Linux graphics driver release.

However, the graphics card does appear in the VM via lspci -vv:

01:00.0 VGA compatible controller: NVIDIA Corporation GA107 [GeForce RTX 3050 6GB] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Micro-Star International Co., Ltd. [MSI] GA107 [GeForce RTX 3050 6GB]
        Physical Slot: 0
        Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin A routed to IRQ 10
        Region 0: Memory at f9000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at 383800000000 (64-bit, prefetchable) [size=8G]
        Region 3: Memory at 383a00000000 (64-bit, prefetchable) [size=32M]
        Region 5: I/O ports at 5000 [size=128]
        Expansion ROM at fa000000 [disabled] [size=512K]
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 16GT/s, Width x8 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR-
                         10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq+ OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
                LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [b4] Vendor Specific Information: Len=14 <?>
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Capabilities: [128 v1] Power Budgeting <?>
        Capabilities: [420 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [bb0 v1] Physical Resizable BAR
                BAR 0: current size: 16MB, supported: 16MB
                BAR 1: current size: 8GB, supported: 8GB
                BAR 3: current size: 32MB, supported: 32MB
        Kernel modules: nvidiafb, nouveau

I have tried several approaches, including:

sudo apt-get remove --purge '^nvidia-.*'
sudo apt-get remove --purge '^libnvidia-.*'
sudo apt-get remove --purge '^libnvidia-.*'

Reinstalling the Linux headers and drivers:

apt-get install linux-headers-$(uname -r)
apt-get remove --purge xserver-xorg-video-nouveau
apt update
apt install --reinstall nvidia-driver-550

I also tried upgrading the kernel, but nothing seems to solve the issue.

If anyone has encountered a similar issue, I would appreciate your help.

Thanks in advance!
nvidia-bug-report.log.gz (2,8 Mo)

And now things have gotten even worse. I’m encountering an error that seems to be related to a problem with the DKMS module during the installation of the NVIDIA 550 driver.

I deleted the crash file, thinking it might be blocking the installation:

sudo rm /var/crash/nvidia-kernel-source-550.0.crash

I also tried to fix broken packages:

sudo apt --fix-broken install
sudo dpkg --configure -a

However, I now get a new error indicating that a patch, “nv-vtophys-explicit-void-cast.patch,” specified in dkms.conf is missing.

I then completely removed the driver in order to reinstall it:

sudo apt-get remove --purge nvidia-dkms-550 nvidia-driver-550
sudo apt-get autoremove --purge
sudo apt-get clean

Followed by:

sudo apt-get update
sudo apt-get install nvidia-driver-550

Unfortunately, I’m still facing the same issue.

I’ve removed the PPA repository using the command:

sudo add-apt-repository --remove ppa:graphics-drivers/ppa

Now I was able to install the driver successfully using:

sudo apt install nvidia-driver-550

However, I’m encountering the same issue with nvidia-smi. The error message is:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Additionally, I see a series of numbers continuously scrolling in the terminal.