Using an A6000 2-Slot NVLink bracket on Geforce 3090

I have the following setup running Ubuntu 18.04 and driver version 460.32.03:

  • 1x PNY NVLink Bracket for the RTX A6000 (2-Slot, RTXA6000NVLINK-KIT)
  • 2x Asus Geforce 3090 (Blower Design, TURBO-RTX3090-24G)
  • 1x NVIDIA NVLink Bracket for the older RTX 6000
  • 2x Asus Geforce 2080Ti (Blower Design)

Now the two 2080 use the NVLink just fine, but the 3090 refuse to acknowledge their bracket. I know that the bracket is not made for these cards but I read here, that it apparently should be compatible, just like with the previous generation:

I get the following output for “nvidia-smi nvlink -c”:

GPU 0: GeForce RTX 3090 (UUID: GPU-26120822-1660-651e-9ec2-9a1462ec5787)
GPU 1: GeForce RTX 3090 (UUID: GPU-37e746db-f366-d879-849d-8af38fe822a0)
GPU 2: GeForce RTX 2080 Ti (UUID: GPU-d05d4e2b-7453-8846-dec6-3970d656bc63)
Link 0, P2P is supported: true
Link 0, Access to system memory supported: true
Link 0, P2P atomics supported: true
Link 0, System memory atomics supported: true
Link 0, SLI is supported: true
Link 0, Link is supported: false
Link 1, P2P is supported: true
Link 1, Access to system memory supported: true
Link 1, P2P atomics supported: true
Link 1, System memory atomics supported: true
Link 1, SLI is supported: true
Link 1, Link is supported: false
GPU 3: GeForce RTX 2080 Ti (UUID: GPU-df4c7aa9-f529-3ddd-75a8-fdb3eb68da8e)
Link 0, P2P is supported: true
Link 0, Access to system memory supported: true
Link 0, P2P atomics supported: true
Link 0, System memory atomics supported: true
Link 0, SLI is supported: true
Link 0, Link is supported: false
Link 1, P2P is supported: true
Link 1, Access to system memory supported: true
Link 1, P2P atomics supported: true
Link 1, System memory atomics supported: true
Link 1, SLI is supported: true
Link 1, Link is supported: false

The output for “nvidia-smi” is:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3090    On   | 00000000:03:00.0  On |                  N/A |
| 30%   30C    P8    22W / 350W |    193MiB / 24267MiB |     10%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 3090    On   | 00000000:04:00.0 Off |                  N/A |
| 30%   29C    P8     8W / 350W |      5MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  On   | 00000000:81:00.0 Off |                  N/A |
| 30%   35C    P8    15W / 250W |      5MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  On   | 00000000:82:00.0 Off |                  N/A |
| 30%   29C    P8    20W / 250W |      5MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2367      G   /usr/lib/xorg/Xorg                140MiB |
|    0   N/A  N/A      2811      G   /usr/bin/gnome-shell               27MiB |
|    0   N/A  N/A      3047      G   ...mviewer/tv_bin/TeamViewer       17MiB |
|    0   N/A  N/A     18287      G   /usr/lib/firefox/firefox            4MiB |
|    1   N/A  N/A      2367      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      2367      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      2367      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

Does anyone have similar experiences? Maybe it’s just the bridge being faulty?

EDIT: Here is my bug-report file, as requested:
nvidia-bug-report.log.gz (1.6 MB)

So far, only one report about using a bridge on 3090’, didn’t work.
https://forums.developer.nvidia.com/t/two-dual-geforce-rtx-3090s-and-nvlink-ubuntu-support-at-least-blender-has-support-for-nvlink/160561/2
Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Thanks for the quick response, sorry I didn’t catch the existing post. I attached the log, as you requested.

General advice, please upgrade your bios. It’s quite old and some nvidia gpu related fixes have been introduced. Shouldn’t matter in the nvlink case, though.
Unfortunately, no docs are currently available regarding nvlink on Ampere, so it’s a guessing game.
Does that bridge have a power led as an indicator it’s properly seated?
A driver update to the latest 460.39 is also advised.
Then, using EFI-boot and enabling “above 4G decoding” in bios comes to my mind, maybe Ampere uses different kind of memory pooling with nvlink.

Thanks a lot for all the pointers and advice. I guess I’ll just set up a completely fresh Ubuntu install (and upgrade the BIOS!), see what happens and report back another bug-report, if necessary. The above 4G decoding is already enabled and unfortunately, the PNY bridge doesn’t seem to have any sort of power indicator.

Many bioses ignore the “above 4G decoding” setting if the CSM is enabled. According to the logs, no 64bit resources are announced by your board.

I now ran the bug report script on Ubuntu 20.04 (fresh install) with driver version 460.39. CSM has been disabled prior to installing, BIOS update hasn’t been done yet. nvidia-smi still gives the same output. Can you spot anything interesting in the log now?

nvidia-bug-report.log.gz (1.5 MB)

Apart from 64bit resources now being properly enabled by bios, no change.

Bummer. Well, but thanks a lot for taking the time to check out the log and also for the advice regarding the setup.

If only there were any successful reports… I am not sure if I should try and return the bracket in case it is just a faulty one. Guess for now I’ll just do the BIOS upgrade and then wait if any more people report their experiences with this.