My laptop is MSI vector gp76 12ugs running intel core 12’th gen.
My configuration is optimus with intel igpu + nvidia dgpu, nvidia is geforce RTX3070ti.
Generally, everything works well. But when I stop using my nvidia dgpu as in, all programs using it including services like nvidia-powerd or persistenced are stopped, after few minutes the device will fall off the bus.
Note the problem actually not being detected by linux, I just use something like nvidia-smi command and get “no such device” error. Trying to remove and bus rescan works except the problem persists and device is still unavailable. Switching aspm in bios to on/off doesn’t work, setting /sys/bus/pci/devices/…/power/control to on also doesn’t.
My use case is both using the card on host and passing it through to a virtual machine. The problem happens when device is idle, both when it’s bound to nvidia driver and not used and when the nvidia driver is unloaded. Probably the same happens if I pass it through then don’t install drivers on the vm side fast enough.
The bios was updated today and nothing changed. I mean the bug was present before, but all the tests I mentioned in the post were done today after updating the bios to latest.
Yes, you should capture the bug report after hitting the issue. I would set up an SSH server, so if the display becomes unresponsive, you can connect remotely to run the bug report and gracefully reboot.
The bugreport log I gave you is definitely useful for things like gathering my system information.
I can reproduce the issue, however trying to perform bugreport capturing after the bug reproduces causes… I don’t know what exactly, but definitely not a display server crash, basically system including ssh becomes unusable. Actually I saw it being sometimes force reset, but unsure if it’s caused by watchdog or…
The only thing I can probably give is kernel logs when nvidia-smi is run and shows “no devices found”.
Note that it doesn’t matter whether during capture the gui is on or not.
I once encountered a situation though where nothing happened to the card even after extended periods of time, this happened when I turned off gui before trying to repro. Then it didn’t repro even after I turned it on. But it could be a coincidence.
any clue? switched to open drivers hoping to fix the issue but it didn’t. interestingly on first run it seemed not to happen. I don’t know what was different, the only thing I know is that it falls off only when nothing holds the card open for some time. and that it happens also when drivers would be unloaded. Sometimes it even happens on system boot, but in this case I usually get full system freeze, probably kernel panic and reboot
first, the problem happens when nvidia card driver is not loaded, when it’s loaded and not bound, and when it’s idle and all processes using it are closed. It happens before first access too if I would never load the nvidia driver and then load it after some time like 30 minutes.
I tried to repeatedly “lspci -vv” searching for the card, and it works, also when card is not used. Even after some time. Until I try to use it, then it really falls off the bus and lspci -vv can’t read it anymore.
The same can be caused by, after leaving card idle without driver loaded/bound, performing “echo 1 > /sys/bus/pci/…/reset”. The reset is seemingly successful, but card falls off the bus.