Recently I have had the misfortune to find the 470.57.02 driver on my RHEL 8.4 system with 4 RTX 3090’s is not able to support a K620 Quadro card. This forces me to use one of the 3090s for video and compute which is unacceptable due to the unalterable “timeout” function NVIDIA has placed in the 3090s. The “timeout” function kills my HPC Computational Chemistry and Biology programs that use the 3090 cards.
I have been told that the driver will support an additional GeForce card which I could use for video and thus avoid the timeout issue.
Is that statement about an additional GeForce card true?
If yes, what is the lowest level of GeForce card I can use for display purposes while leaving the 3090 cards in the system for compute only?
How does that manifest? Is there an error during installation? The Quadro K620 is based on the Maxwell architecture (specifically, GM107GL), which is still supported by current drivers best I can tell. I am aware that NVIDIA has sometimes recycled GPU names for GPUs of a different architecture, but I am not aware that this affects the Quadro K620, according to all sources I consulted. A Quadro K420 or a Quadro K600 would be a different story: Based on the Kepler architecture and no longer supported by modern drivers since late 2019.
The GUI watchdog timer is a function of the operating system. Any GPUs excluded from servicing the GUI (e.g. X) should not be affected by it. The purpose of the watchdog timer is to prevent prolonged freeze-up of the GUI when the GPU is serving long-running compute kernels. Generally speaking the GUI timeout limit is around 2 seconds.
You should be able to configure your system so only the Quadro services the GUI, leaving the RTX 3090s for compute work.
Anything with compute compatibility of 5.0 (Maxwell) and up should be suitable at this time. Unless there is an issue with support having been dropped for Maxwell-architecture GPUs that I am not aware of. In which case the lowest level would be compute capability 6.0 (Pascal) and up. I am running a Pascal-based GPU with the latest drivers, so those are supported for sure.
Another option, which may not be practical long term, would be to switch the system to run level 3 (no X server) and access it from another system (ssh -X …).
If you switch to runlevel 3, its still possible to put the character console on the GPU, if desired. Linux is directly manipulating the VGA functionality on the GPU in that case, and there is no watchdog on that.
I don’t know what that means. The supported products tab on the 470.57.02 driver page indicates that both K600 (cc3.0) and K620 (cc5.0) are supported.
njuffa:
The error manifests by the video always being directed to the 1st 3090. No matter what system slot I place the K620 it never gets video
The three 3090s that are in slots after the 1st 3090 do not provide video. I have no idea how to configure my system so that only the Quadro services the GUI. Specification of the K620 in the xorg.conf file does not fix this.
rs277:
Thanks. That would be a last resort since I have never done X over an ssh link and am clueless on how to do it.
Robert:
Thanks. In my RHEL 8.4 with cuda kit 11.4 system I can find no way to get the video to come out of the K620 port. If it is present, the 1st 3090 is ALWAYS selected for video. Also, the primary app I use for Computational Biology (Amber) requires that I use a X environment for prep of the simulation and analysis of its results.
I stand corrected. I was quite certain that the last driver I was able to install for the Quadro K420 in one of my systems dates to late 2019. Apparently I was wrong about that.
If I can do it, I’m sure you won’t have any trouble ;o)
Check out the man page for ssh and ssh_config and the option ForwardX11Trusted. If you’re in a safe environment, you can use the “-Y” option instead. Once logged in you can start running any X app on the box.
Confidentiality Notice | This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential or proprietary information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, immediately contact the sender by reply e-mail and destroy all copies of the original message.
I’m convinced that this is a configurable setting in whatever X variant RHEL 8.4 happens to be using. The NVIDIA GPU (display) driver obeys X in my experience. If you’re having trouble with this, possibly the easiest way to resolve it would be to use nvidia-settings on linux to resolve whatever problem there is. I don’t have the capability to sort out an issue like this for you, but if that is the problem there is a separate forum that has many questions on how to properly configure X. If I had trouble I would just use nvidia-settings.
The only other possibility is that your system BIOS is doing something very bizarre. This would require some spelunking with lspci, but I doubt it is the issue here. It’s not where I would start.
Perhaps. X apps that don’t require OGL accelerated graphics will probably “just work”. Any sort of OGL accelerated graphics display (such as what you might use with VMD, for example) would require additional “plumbing” in order to work correctly remotely. It’s doable, of course.
It also should be possible to disable the watchdog timer on linux via X configuration. This note gives an example. However as newer Linux variants have adopted newer X managers they may have moved away from the “old school” configuration method via xorg.conf, so that particular method may or may not work on RHEL 8.4. If I wanted to use that, I would first start by using nvidia-settings to write out an X configuration that I understood, then modify that.
One of the ways would be just to verify that you no longer run into the objection you raised at the beginning of this thread:
That’s admittedly inferential and maybe not fitting your definition of verify.
I guess the other thing I would check (which I’m not really sure about) is the output from deviceQuery sample application on that setup.
When people ask about the presence/absence of a timeout in linux, this is usually my suggestion. Ignoring the discussion in this thread, I normally direct them to this line of output (for each GPU):
Device 3: "Tesla K20Xm"
CUDA Driver Version / Runtime Version 11.4 / 11.4
...
Run time limit on kernels: No
The “Run time limit on kernels” is what I normally use to verify whether the timeout is active, or not. This does depend on X configuration - you can confirm this yourself with a bit of experimentation. A GPU that is made “visible” to X in the xorg.conf (or whatever mechanism may be in effect) will normally have this attribute set to “Yes” whereas a GPU that is not part of the xorg.conf (or similar mechanism) will normally have this attribute set to “No”.
What I’ve never verified is how a GPU that is visible to X (and therefore is or can support a display), but has its “Interactive” option in the xorg.conf set to “off” will appear in the deviceQuery output. My sense is that it should have the runtime limit displayed as “No” but I’ve never confirmed this.
Thanks. I brought up the server and went to the CUDA Samples Utilities folder, found deviceQuery and ran it. It gave me “Run time limit on kernels: No” on all the 3090 cards.
It appears we are set to go. In addition, I purchased a RTX 3060 card. It will be installed and used only for video. That should avoid the capricious nature of the NVIDIA driver changes.
No evidence for “capricious NVIDIA driver changes” has been provided in this thread. However, there seem to be indications of changes to X configuration mechanisms under Linux.