RTX 3090/ 270 driver/ incompatibility with QUADRO

jimkress · October 19, 2021, 10:41pm

Recently I have had the misfortune to find the 470.57.02 driver on my RHEL 8.4 system with 4 RTX 3090’s is not able to support a K620 Quadro card. This forces me to use one of the 3090s for video and compute which is unacceptable due to the unalterable “timeout” function NVIDIA has placed in the 3090s. The “timeout” function kills my HPC Computational Chemistry and Biology programs that use the 3090 cards.

I have been told that the driver will support an additional GeForce card which I could use for video and thus avoid the timeout issue.

Is that statement about an additional GeForce card true?
If yes, what is the lowest level of GeForce card I can use for display purposes while leaving the 3090 cards in the system for compute only?

Thanks for any help.

njuffa · October 19, 2021, 10:57pm

How does that manifest? Is there an error during installation? The Quadro K620 is based on the Maxwell architecture (specifically, GM107GL), which is still supported by current drivers best I can tell. I am aware that NVIDIA has sometimes recycled GPU names for GPUs of a different architecture, but I am not aware that this affects the Quadro K620, according to all sources I consulted. A Quadro K420 or a Quadro K600 would be a different story: Based on the Kepler architecture and no longer supported by modern drivers since late 2019.

The GUI watchdog timer is a function of the operating system. Any GPUs excluded from servicing the GUI (e.g. X) should not be affected by it. The purpose of the watchdog timer is to prevent prolonged freeze-up of the GUI when the GPU is serving long-running compute kernels. Generally speaking the GUI timeout limit is around 2 seconds.

You should be able to configure your system so only the Quadro services the GUI, leaving the RTX 3090s for compute work.

Anything with compute compatibility of 5.0 (Maxwell) and up should be suitable at this time. Unless there is an issue with support having been dropped for Maxwell-architecture GPUs that I am not aware of. In which case the lowest level would be compute capability 6.0 (Pascal) and up. I am running a Pascal-based GPU with the latest drivers, so those are supported for sure.

rs277 · October 19, 2021, 11:23pm

Another option, which may not be practical long term, would be to switch the system to run level 3 (no X server) and access it from another system (ssh -X …).

Robert_Crovella · October 19, 2021, 11:59pm

If you switch to runlevel 3, its still possible to put the character console on the GPU, if desired. Linux is directly manipulating the VGA functionality on the GPU in that case, and there is no watchdog on that.

I don’t know what that means. The supported products tab on the 470.57.02 driver page indicates that both K600 (cc3.0) and K620 (cc5.0) are supported.

jimkress · October 20, 2021, 12:37am

Thank you all for the replies.

njuffa:
The error manifests by the video always being directed to the 1st 3090. No matter what system slot I place the K620 it never gets video
The three 3090s that are in slots after the 1st 3090 do not provide video. I have no idea how to configure my system so that only the Quadro services the GUI. Specification of the K620 in the xorg.conf file does not fix this.

rs277:
Thanks. That would be a last resort since I have never done X over an ssh link and am clueless on how to do it.

Robert:
Thanks. In my RHEL 8.4 with cuda kit 11.4 system I can find no way to get the video to come out of the K620 port. If it is present, the 1st 3090 is ALWAYS selected for video. Also, the primary app I use for Computational Biology (Amber) requires that I use a X environment for prep of the simulation and analysis of its results.

njuffa · October 20, 2021, 12:38am

I stand corrected. I was quite certain that the last driver I was able to install for the Quadro K420 in one of my systems dates to late 2019. Apparently I was wrong about that.

njuffa · October 20, 2021, 12:44am

It has been too long ago that I configured an RHEL system, but I think the configuration information needs to go into a file

/etc/X11/xorg.conf.d/nvidia.conf

Forum participants with more recent and more extensive knowledge about this should be able to give relevant pointers.

rs277 · October 20, 2021, 1:07am

If I can do it, I’m sure you won’t have any trouble ;o)

Check out the man page for ssh and ssh_config and the option ForwardX11Trusted. If you’re in a safe environment, you can use the “-Y” option instead. Once logged in you can start running any X app on the box.

jimkress · October 20, 2021, 1:09am

Thanks.

Jim

James Kress Ph.D., President

The KressWorks® Institute

An IRS Approved 501 (c)(3) Charitable, Nonprofit Corporation

“Engineering The Cure” ©

(248) 573-5499

Learn More and Donate At:

Website: http://www.kressworks.org

Confidentiality Notice | This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential or proprietary information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, immediately contact the sender by reply e-mail and destroy all copies of the original message.

Robert_Crovella · October 20, 2021, 1:16am

I’m convinced that this is a configurable setting in whatever X variant RHEL 8.4 happens to be using. The NVIDIA GPU (display) driver obeys X in my experience. If you’re having trouble with this, possibly the easiest way to resolve it would be to use nvidia-settings on linux to resolve whatever problem there is. I don’t have the capability to sort out an issue like this for you, but if that is the problem there is a separate forum that has many questions on how to properly configure X. If I had trouble I would just use nvidia-settings.

The only other possibility is that your system BIOS is doing something very bizarre. This would require some spelunking with lspci, but I doubt it is the issue here. It’s not where I would start.

Robert_Crovella · October 20, 2021, 1:21am

Perhaps. X apps that don’t require OGL accelerated graphics will probably “just work”. Any sort of OGL accelerated graphics display (such as what you might use with VMD, for example) would require additional “plumbing” in order to work correctly remotely. It’s doable, of course.

Robert_Crovella · October 20, 2021, 1:34am

It also should be possible to disable the watchdog timer on linux via X configuration. This note gives an example. However as newer Linux variants have adopted newer X managers they may have moved away from the “old school” configuration method via xorg.conf, so that particular method may or may not work on RHEL 8.4. If I wanted to use that, I would first start by using nvidia-settings to write out an X configuration that I understood, then modify that.

jimkress · October 20, 2021, 1:35am

Thanks for the reply. I looked at nvidia-settings. That will take some time to master but it deserves a better, in-depth analysis.

Also, you are correct about the accelerated graphics concerns. I do use VMD as one of my app.

jimkress · October 21, 2021, 5:03am

I tried:

Option “Interactive” “off”

in the /etc/X11/xorg.conf

Section “Screen”
Identifier “Screen0”
Device “Device2”
Monitor “Monitor0”
DefaultDepth 24
Option “Interactive” “off”
SubSection “Display”
Depth 24
EndSubSection
EndSection

How do I confirm it is in effect?

Robert_Crovella · October 21, 2021, 4:01pm

One of the ways would be just to verify that you no longer run into the objection you raised at the beginning of this thread:

That’s admittedly inferential and maybe not fitting your definition of verify.

I guess the other thing I would check (which I’m not really sure about) is the output from deviceQuery sample application on that setup.

When people ask about the presence/absence of a timeout in linux, this is usually my suggestion. Ignoring the discussion in this thread, I normally direct them to this line of output (for each GPU):

Device 3: "Tesla K20Xm"
  CUDA Driver Version / Runtime Version          11.4 / 11.4
  ...
  Run time limit on kernels:                     No

The “Run time limit on kernels” is what I normally use to verify whether the timeout is active, or not. This does depend on X configuration - you can confirm this yourself with a bit of experimentation. A GPU that is made “visible” to X in the xorg.conf (or whatever mechanism may be in effect) will normally have this attribute set to “Yes” whereas a GPU that is not part of the xorg.conf (or similar mechanism) will normally have this attribute set to “No”.

What I’ve never verified is how a GPU that is visible to X (and therefore is or can support a display), but has its “Interactive” option in the xorg.conf set to “off” will appear in the deviceQuery output. My sense is that it should have the runtime limit displayed as “No” but I’ve never confirmed this.

jimkress · October 21, 2021, 8:10pm

Thanks. I brought up the server and went to the CUDA Samples Utilities folder, found deviceQuery and ran it. It gave me “Run time limit on kernels: No” on all the 3090 cards.

It appears we are set to go. In addition, I purchased a RTX 3060 card. It will be installed and used only for video. That should avoid the capricious nature of the NVIDIA driver changes.

Thanks for your help.

Jim

njuffa · October 21, 2021, 8:19pm

No evidence for “capricious NVIDIA driver changes” has been provided in this thread. However, there seem to be indications of changes to X configuration mechanisms under Linux.

Topic		Replies	Views
Simple CUDA program hitting size limits/errors on Windows but not Linux CUDA Programming and Performance	23	1930	January 12, 2019
2 Tesla C1060s with a legacy GeForce FX 5200 card Need help editing the xorg.conf file for multiple CUDA Programming and Performance	28	35539	January 29, 2009
cuda-dbg forces execution on device 0 cuda-gdb forces execution on gpu running X windows CUDA Programming and Performance	7	9683	June 16, 2009
"Display driver stopped responding and has recovered" WDDM Timeout Detection and Recovery CUDA Programming and Performance	19	160410	February 4, 2012
Tesla K40 vs. Quadro M6000 vs. GeForce Titan X CUDA Programming and Performance	12	45381	April 7, 2015
Tesla C870 and Linux RHEL 4.5 CUDA Programming and Performance	13	28865	February 28, 2008
Launch Timeouts CUDA Programming and Performance	32	21805	May 4, 2011
CUDA Timeout? CUDA Programming and Performance	7	27697	December 19, 2011
GPU timeout \| lockup Linux	14	1533	July 7, 2024
Problems running CUDA on non-primary display CUDA Programming and Performance	23	54511	June 27, 2008

RTX 3090/ 270 driver/ incompatibility with QUADRO

Related topics