My NVIDIA Quadro RTX 5000 16 GB Graphics Card has failed to run properly during application usage in Ubuntu. The software called Desmond requires an exclusive graphics card to run the application. It worked well for the past two years, but now the application fails during operation. I had the graphics card checked at an authorized service center in Calicut. According to the customer service team, they sent the graphics card to the Mumbai service center, where it underwent a performance check in Windows and was reported as working fine. However, the problem persists. Since the Desmond software only functions in Ubuntu, I contacted customer care, installed the latest driver, and retested it. Unfortunately, the same issue persists during application execution.
Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.
nvidia-bug-report.log.gz (413.4 KB)
can anyone help me out please
Driver looks fine, graphics is also working. What error is the application reporting?
The normal functioning of the application is fine, but when running the specific module that exclusively works with the graphics card, the system restarts, and the work fails. Initially, I used it for almost one year without errors during the run. However, later on, the system started restarting rarely. Subsequently, the failures became regular, prompting me to send the system to the service center. Upon verification, they found that the SSD, monitor, and Deep cool system had failed, so they replaced those components. At that time, they checked the graphics card and confirmed it was normal. Despite the service, the system still restarts during application runs. Consequently, I sent the graphics card to the authorized service center again, and they confirmed GC is normal. But the problem persisit as such. Initially, most of the time the errors I got were related to GC. I will share those details below
Specification of the workstation :-AMD Ryzen 7 5800x (8 cores upto 4.7 ghz)
- ASUS B550 TUF GAMING + WiFi mother
board
-Corsair 16GB 3200MHZ DDR4 RAM x 2NOS - Cooler Master 1050W 80+ GOLD SMPS
-Acer 1TB Pcie NVME M.2 SSD
-Dell 24" 75HZ MONITOR S2421HN - Matrix 50 HYBRID COOLING CABINET
- Gammax 120 Cooler
- NVIDIA QUADRO RTX 5000 TURING
ARCHITECTURE 16GB GRAPHICS
Errors I am getting:
- glxinfo not found
2.qt.qpa.plugin: Could not load the Qt platform plugin “xcb” in ""eventhogh it was found.
Ok, that clarifies it. Spontaneous reboots are only triggered by the mainboard, most often in case of a power issue. I’d suspect your psu is failing and breaks down when the gpu does a clock boost.
To check, you can temporarily limit gpu clocks by running
sudo nvidia-smi -lgc 300,1200
and then run the specific plugin/function of said application.
this is the message I got. Now I will run the module and check
GPU clocks set to “(gpuClkMin 300, gpuClkMax 1200)” for GPU 00000000:07:00.0
Warning: persistence mode is disabled on device 00000000:07:00.0. See the Known Issues section of the nvidia-smi(1) man page for more information. Run with [–help | -h] switch to get more information on how to enable persistence mode.
All done.
That’s the normal confirmation message. Now check if your application works.
No…Its restartrd very soon after the click. Earler it was running atleast for 5 min -30 min. During the restart I got this message
Please create a new nvidia-bug-report.log without rebooting.
nvidia-bug-report.log.gz (410.8 KB)
The MCE error logged would rather point to broken RAM. You could pull one module, check if the system stays alive during application run. If not, check the other module.
`You mean one of the two 16 GB RAM ?
Correct.
I removed the first module, but it failed. Afterward, I replaced it and removed the second one. Now it has started running. I need to wait for 10-15 hours to complete the run, and I will update the progress.
It failed again. However, with the second module, the application ran for 4.5 hours before restarting (Failure).
waiting for your reply
I attempted a second time, and it was successful. I kept it for another run to test it again