Linux SLI and NVlink

Hi All,
I am trying to setup SLI for two TITAN Xp cards on a Debian 10 buster Gnome X11 (with kernel+driver backports applied) system according to
https://download.nvidia.com/XFree86/Linux-x86_64/435.17/README/sli.html

The system starts and I can login regularly. However, logging in the screen goes black and will eventually go to sleep (I can still ssh into the system).

The system works regularly, if I force (Option “SLI” “Off”) in the “Screen” section of the /etc/X11/xorg.config.
Any inputs on configuring the system correctly for SLI would be greatly appreciated.

Please don’t do this. SLI is broken and useless on Linux. For further info, please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post. You will have to rename the file ending to something else since the forum software doesn’t accept .gz files (nifty!).

1 Like

Thank you for the information, I’ll stop experimenting with SLI.

The reason why I tried this (and bought an Nvidia HB SLI bridge), was to evaluate card stacking for machine learning in Linux. I am well aware that SLI is not used in compuation but NVlink is, I just did not have RTX cards at hand.

Is there any reason to believe that NVlink, e.g. of two Tital RTX, will behave differently?

Many thanks.

And has been since the feature was introduced, unfortunately. It never worked right. You might want to remove the (Nvidia HB SLI bridge) bridge if it’s connected. You use multiple cards without it. Using DeepStream, for example, you tell a nvinfer element to use a specific device. Likewise with training frameworks.

NVlink will greatly improve inter-gpu transfer-rates, great for cuda. Just SLI graphics is pointless.
Put simply, great for machine-learning, shipwrecked trying to use it for desktop.

1 Like

@mdegans that is how I have been using the cards beforehand. However, I am increasingly running into memory limitations. Hence the idea to experiment with GPU stacking.

Just to be sure, Geforce NVlink (like the one found in RTX 2080 Ti or Titan RTX) does rely on a sufficiently different driver subsystem (not sure how to call it) such that it can be used for compute on Linux?

Out of technical interest, are there benefit/drawbacks having an SLI bridge connected when SLI is deactivated in the driver?

are there benefit/drawbacks having an SLI bridge connected when SLI is deactivated in the driver

TBH, I don’t remember the details other than I had to remove it. It worked in Windows but didn’t in Linux is all I remember. The issue might be fixed by now, but if you’re going to use the box for Linux alone, it’s pointless anyway.

Geforce NVlink (like the one found in RTX 2080 Ti or Titan RTX) does rely on a sufficiently different driver subsystem (not sure how to call it) such that it can be used for compute on Linux?

I believe so, but you will want to confirm that with an nvidia rep before shelling out for such cards.

You’ll have to distuingish between usage on Xorg and cuda. On xorg, simply disable the usage of sli, it’ll then make no difference of having the bridge plugged in or not. For cuda, have it plugged, it’ll be used without any additional config.
Simply put, leave it on, disable sli usage in xorg config.
-but-
In general, having Xorg run on the same gpu you’re extensively using for cuda is not recommended. In default config, this can lead to your cuda kernels being killed mid-air, like during running larger training jobs. See:
https://nvidia.custhelp.com/app/answers/detail/a_id/3029/~/using-cuda-and-x
So the recommendation would be to use a cheap add-on card or the integrated graphics for desktop and dedicate the heavy gpus for cuda. See CUDA_VISIBLE_DEVICES env variable.

1 Like

Addendum: since you have a specific problem with out-of-memory conditions, I don’t think that nvlink will help in that situation, the gpu memory won’t add up. Of course interested in pratical experiences, so please report back.

have it plugged, it’ll be used

Are you certain? I use CUDA on multiple GPUs without it and it seems fine. I thought the sli was only for frame sync. Anyway the machine in question doesn’t even have x11 running anymore.

You’re mixing up (not too) different things. SLI is a basically a marketing term for gpu coupling over pci(e) for graphics purposes from ancient times.
nvlink derived from that and adds up to it but gained independence over the years in terms of cuda.
Like said,

NVlink will greatly improve inter-gpu transfer-rates

whether your cuda application will benefit of it, depends.

See this for some interconnect speeds:
https://forums.developer.nvidia.com/t/simplep2p-example-and-multi-gpu-network-training-causes-system-freeze-and-err-in-nvidia-smi/70225/5?u=generix

Yeah, but I don’t think the “Nvidia HB SLI bridge” is NVLink. IIRC that’s for the 10x series and only works for games.

“Nvidia HB SLI bridge” is a name of a physical product establishing an “NVLink”
only works for games: No.

IIRC, “HB” means high bandwidth, so it’s “NVlink2”

For completeness, why SLI graphics sucks:
https://developer.nvidia.com/explicit-multi-gpu-programming-directx-12

I think your you are mistaken, or I have read some bad documentation.

My understanding is NVLink is a PCI Express alternative mostly for some exotic architectures. They do use it like SLI on some very fancy cards, on top, in addition to PCI Express, and this is confusing.

Take a look at those speeds. The two technologies or nothing alike.

Are you running Linux headless or with wayland?
If it is the latter, is it tricky to set up?