[Regression 460 series] Black screen on boot: nvidia-modeset: ERROR: GPU:0: Failed to allocate display engine core DMA push buffer

@amrits Are there any updates on bug number 3358939 you could share?

Since only 390.157 legacy drivers remain working with Kepler cards on affected systems, for which support was dropped last year, those systems are now essentially rendered unusable. It would be nice to see a fix for this 2.5 years old regression such that the hardware can still be used. Thanks in advance!

1 Like

Hi here,
I had this issue for years maybe much more than 2021, probably since 2020 when I switch on a new display-port screen. I have same hardware since 2015 a MSI GTX970. I’m on Archlinux since the beginning. As I’m writing this post I’m currently on nvidia 535.113.01-4 and this issue is fixed for me around Jun 2023.
I just give up after this marvelous post

Since this bug is very harware specific, it’s rarely ever getting fixed.

So after 2 years of covid ( aka the start of the end of the world ) and the global warming end of the world that no one care, we might have hope because Nvidia has fix real bug after 2 years. I think you should re-think your entire long term strategy because if we want to survive in this world, we seriously need to stop produce new hardware every year and think more about supporting the old hardware twice as long as before ( or even for ever ).

So please I’m 100% behind @olifre fix the issue for older cards or at least provide sources for vbios, drivers so we can do it ourselves.

Just read this post from 2014 it’s hilarious gtx 970 gaming 4g and legacy bios | MSI Global English Forum

but companies like nvidia and amd don’t care about past, they go for today or for the future.

1 Like

Hi Hugo, hi all,

thanks for the kind words! I can only fully support all your statements.
I took the chance to test out latest legacy (Kepler-supporting) 470.223.02 tonight, who knows whether a fix from the 535 line was backported…
As almost expected, the issue still remains:

Nov  4 20:48:24 localhost kernel: [   33.420774] nvidia-modeset: ERROR: GPU:0: Display engine push buffer channel allocation failed: 0x65 (Call timed out [NV_ERR_TIMEOUT])
Nov  4 20:48:24 localhost kernel: [   33.421354] nvidia-modeset: ERROR: GPU:0: Failed to allocate display engine core DMA push buffer

In the unlikely case someone from nvidia still cares about this long-standing regression which was introduced shortly after 455.45.01 makes a lot of hardware unusable with nvidia drivers, and was once tracked as 3358939 (not sure whether this issue still exists with nvidia), I’ve also snatched another nvidia-bug-report:
nvidia-bug-report-47022302.log.gz (1.3 MB)

Hope this helps in case somebody from nvidia comes along,
Oliver

It’s broken again, I think by totally random chance of fixing something else in 535.x.x driver train, this bug was fix. But in Archlinux we switch to 545.x.x in november, I upgrade my system recently and it’s broken in 545.

nvidia-535.113.01-4-x86_64 → Working
nvidia-545.29.06-6-x86_64 → Not working

Details about what’s going on with the driver 545.xxx
After suspend I have 100% of the time a black screen, and the driver remains loaded but X hangs with 100% cpu usage. After killing Xorg, it’s nvidia-sleep,sh resume hanging with 100% cpu 🤪.
I tried to unload the kernel module and “it’s in use”, nvidia-smi can’t reach the driver and hangs indefinitely without printing anything.
So there is something wrong definitely and now it’s not flaky but occur every time so suspend is unusable with my setup.
Card : MSI GTX970
Kernel : 6.6.x and I have DRM enabled
Boot : UEFI with secureboot enabled

After blackscreen, the host remains 100% usable with SSH, but I need to reboot to get back my screen.

journalctl-suspend-nvidia.log (7.9 KB)