Extremely corrupted graphics _only_ with NVidia driver 515 on RTX 3070 Ti

The problem at hand is such an extreme corruption of graphics on the screen that the system is unusable. I have seen just about every other form of corrupted graphics, but nothing like this.

It started out as a “little glitch that came and went” and devolved into an extreme corruption of graphics that makes the system absolutely unusable because I can’t read any text, or see pretty much anything.

At first, it would manifest mostly when the screen was locked, sometimes when it wasn’t, and affect only the top 10-15% of the screen and a fixed-size square area down-and-right of the mouse cursor:

It was getting worse day after day, soon the entire screen was affected, the corruption looked like large areas of the image would turn one color (blue, magenta, green, etc.) following the content of the screen, see here how the corrupted areas align with the clouds, sky, water, etc. in the pictures:

At this point graphics are corrupted as soon as the login manager starts up, and as soon as I log in even the very simple splash screen is corrupted, and then the desktop is so badly corrupted I can barely see where the cursor is and I can’t really use the terminal. Even if I go to the text-only terminal with Ctrl+Alt+F2 the whole thing is so badly corrupted I can’t read anything.

Here is one photo and two videos: https://photos.app.goo.gl/RrjNDP8Tn7jhaH2S9
The photo and first video (Sept 14) show the lock screen, the second video (Sept 17) shows the login screen right after rebooting with the NVidia driver 515.

The card is an ASUS GeForce TUF RTX 3070 Ti O8G-GAMING (8 GB) purchased on June 2022 to replace an ASUS STRIX GTX1070 8GB purchased in August 2017. The new card had been working quite well until a few weeks ago, the problem started small but then started getting worse and I can’t seem to find a workaround.

When the problem started I was running Ubuntu Studio 20.04 with KDE Plasma. After a couple of weeks of trying a few tweaks (e.g. disable compositor) and seeing how nothing really helped, I decided to install Ubuntu Studio 22.04 and at first it looked perfect, but then the problem manifested again as soon as the first reboot with the NVidia driver.

At this point the latest NVidia driver on Ubuntu 22.04 is 515. Other versions available are 510 (2nd last) and 470. Tried going back to 470, thinking the problem started manifesting with a 5xx version, but the graphics are just as corrupted. The only workaround I have right now to be able to use this PC is to go back to the nouveau driver.

The same screen (Dell U3421WE 34" 3440x1440) and everything else works perfectly for long work shifts on a Lenovo laptop running Debian. Both computers share the screen via a StarTech 4k@60 DisplayPort KVM switch (and all cables are DP 1.2 rated 4K@60), I am able to switch from an extremely corrupted display on the PC to a perfectly fine display on the laptop, back and forth just fine. Tried connecting the screen directly to the GPU with a brand new DisplayPort cable but, unsurprisingly, it didn’t make any difference.

So far the problem seems to be in either the GPU (hardware) or the NVidia driver (515) but I’m hoping this is just something gone bad in the system environment that somehow triggers a problem that only reproduces with the above combination. Attaching logs from nvidia-bug-report.sh run after logging in with the corrupted graphics (collected via SSH).

Log from nvidia-bug-report.sh attached now:

nvidia-bug-report-after-login.txt (991.3 KB)

A friend suggested I try the Pop!_OS live USB with NVidia driver, so I did… same problem, I made yet another video but it’s basically more of the same. I tried to look around the setting with nvidia-settings but the display corruption got worse so fast I couldn’t read anything within seconds, and then a couple second later it was all big blobs of pink and green flashing all over the display.

Please, anyone, is there anything else I can try, before replacing the card?

You might want to check for a general hardware fault using gpu-burn or cuda-gpumemtest.

Thanks for the points, I had no idea about these tools.

So far they haven’t found any errors, but I just got started.

First, disabled SDDM and rebooted. With SDDM disabled, there is no corruption of graphics, I wonder if this might prevent the tools from finding the problem.

Installed the CUDA toolkit from CUDA Toolkit 11.7 Update 1 Downloads | NVIDIA Developer and then built and run both wilicc’s gpu-burn and ComputationalRadiationPhysics’s cuda_memtest (please let me know if these are not the ones I should be using).

Tried gpu_burn 120 and gpu_burn -d 120 and both said the GPU is “OK”, but this seems a very short test and the example in GitHub is to run it for an hour, so I guess I should try that?

Built cuda_memtest with
-DCMAKE_CUDA_ARCHITECTURES=86 and
-DCMAKE_CUDA_COMPILER:PATH=/usr/local/cuda-11.7/bin/nvcc
and it’s been running for an hour with
./cuda_memtest --stress

and so far nothing. It’s been testing pattern after pattern and they all pass finish in 14.9 seconds:

[09/19/2022 18:56:35][rapture][0]:Test10 with pattern=0x1a46e7680ab51955
[09/19/2022 18:56:49][rapture][0]:Test10 finished in 14.9 seconds
[09/19/2022 18:56:49][rapture][0]:Test10 [Memory stress test]

I’ve no idea how long this is supposed to take, but I think the help says it runs 1,000 iterations and each of them takes 15 s. so this would take a bit over 4 hours to finish.

If both tests find no errors, what can I try next?

I’m planning to try again with SDDM enabled while graphics are corrupted, other than that I don’t know what else to try.

Thanks!

Usually, 10 minutes of gpu-burn should be enough to find any damages. Might also be the connector or pll that’s broken, this can’t really be tested. Did you already use a different output on the gpu?

OK, I’ll try gpu-burn for 10, maybe even 20 min… after cuda_memtest finishes? It’s been going for 3 hours without errors, I hope it stops at some point soonish.

Already tried 2 DisplayPort outputs, with a brand new DP 1.2 connected directly to the screen (more details in the initial post). Just realized there is a 3rd DP output, I’ll try that too. There is also 1 HDMI outputs, I guess I should try that one too?

After 4.5 hours, cuda_memtest was still running and hadn’t found any errors. I stopped it and re-run gpu_burn for 20 minutes, and it still said the GPU was “OK”.

However, when I went back to the PC (I was testing via SSH), the TTY console that was fine at the beginning was now badly corrupted.

Enabled SDDM and rebooted, this time straight into a very corrupted login screen, much worse than before: https://photos.app.goo.gl/HMM8pu3eoaD5H4cPA

Run gpu_burn for 10 minutes and it still said the GPU was “OK”, despite the GPU clearly not being working “OK”.

Not sure if this has been caused by cuda_memtest and/or gpu_burn, but the problem is a lot worse now: graphics are corrupted as soon as Grub shows up, while previously the console was totally fine until Xorg started.

Even worse, going back to the nouveau driver makes little difference now, the corrupted graphics just look a little “blockier” and greener, but it’s all unbearable and unusable.

I’m afraid this GPU will never work well again :(

Yes, I think it’s broken in the output part, heating it up made things worse. So you should check warranty.

To close the loop: indeed it was broken, returned it and got a replacement. Hopefully a good one!