Multiple NVIDIA RTX GPU for Cuda (arch linux) with EGPU

I’ve got an arch linux, with two GPU in the laptop (thinkpad P14s Gen 4) + a new RTX 3090 plugged via thunderbolt 4 with the Cool Master EG200 GPU enclosure:

❯ lspci -k | grep -A 2 -E "(VGA|3D)"
00:02.0 VGA compatible controller: Intel Corporation Raptor Lake-P [Iris Xe Graphics] (rev 04)
        Subsystem: Lenovo Raptor Lake-P [Iris Xe Graphics]
        Kernel driver in use: i915
--
03:00.0 3D controller: NVIDIA Corporation GA107GLM [RTX A500 Laptop GPU] (rev a1)
        Subsystem: Lenovo GA107GLM [RTX A500 Laptop GPU]
        Kernel driver in use: nvidia
--
22:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
        Subsystem: Gigabyte Technology Co., Ltd GA102 [GeForce RTX 3090]
        Kernel driver in use: nvidia

The thunderbolt connection to the RTX 3090 is authorized as you can see here:

❯ sudo boltctl info c4010000-0070-740e-0362-00168691c921
[sudo] password for aemonge: 
 ● Cooler Master Technology,Inc MasterCase EG200
   ├─ type:          peripheral
   ├─ name:          MasterCase EG200
   ├─ vendor:        Cooler Master Technology,Inc
   ├─ uuid:          c4010000-0070-740e-0362-00168691c921
   ├─ dbus path:     /org/freedesktop/bolt/devices/c4010000_0070_740e_0362_00168691c921
   ├─ generation:    Thunderbolt 3
   ├─ status:        authorized
   │  ├─ domain:     69078780-60ab-fe2a-ffff-ffffffffffff
   │  ├─ parent:     69078780-60ab-fe2a-ffff-ffffffffffff
   │  ├─ syspath:    /sys/devices/pci0000:00/0000:00:0d.2/domain0/0-0/0-1
   │  ├─ rx speed:   40 Gb/s = 2 lanes * 20 Gb/s
   │  ├─ tx speed:   40 Gb/s = 2 lanes * 20 Gb/s
   │  └─ authflags:  boot
   ├─ authorized:    Wed 24 Jan 2024 06:49:10 AM UTC
   ├─ connected:     Wed 24 Jan 2024 06:49:10 AM UTC
   └─ stored:        Tue 23 Jan 2024 03:50:50 PM UTC
      ├─ policy:     iommu
      └─ key:        no

I really don’t care for the graphics, nor the RTX3090 to be loaded in the xorg nor the graphical interface. I just want it to be used as compute only workloads, and I have followed thouroly this arch wiki External GPU - ArchWiki

But givien that context, my nvidia-smi can’t seam to find the GPU:

❯ nvidia-smi -L
GPU 0: NVIDIA RTX A500 Laptop GPU (UUID: GPU-762410c2-1c0d-ef4a-89ac-91afd926381b)

Nor can a simple python script, cuda-devices.py:

❯ cat cuda-devics.py
import torch

# Check if CUDA is available
if torch.cuda.is_available():
    print("CUDA is available.")
    # Get the number of CUDA devices
    num_devices = torch.cuda.device_count()
    print(f"Number of CUDA devices: {num_devices}")
    # Get the name of each CUDA device
    for i in range(num_devices):
        print(f"Device {i} name: {torch.cuda.get_device_name(i)}")
else:
    print("CUDA is not available.")
❯ python cuda-devics.py
CUDA is available.
Number of CUDA devices: 1
Device 0 name: NVIDIA RTX A500 Laptop GPU

❯ CUDA_VISIBLE_DEVICES="0,1,2" python cuda-devics.py

CUDA is available.
Number of CUDA devices: 1
Device 0 name: NVIDIA RTX A500 Laptop GPU

I have also tried with these three repositories GitHub - ewagner12/all-ways-egpu: Configure eGPU as primary under Linux Wayland desktops , GitHub - karli-sjoberg/gswitch and GitHub - hertg/egpu-switcher: 🖥🐧 Setup script for eGPUs in Linux (X.Org). To disable the internal GPU’s A500 and Iris Xe but it’s blaking (black screen).

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Hi @generix , this is the result. Thanks in advance

nvidia-bug-report.log.gz (463.3 KB)

RmInitAdapter failed! (0x26:0x56:1482)
Same as this:
https://forums.developer.nvidia.com/t/dont-enter-graphical-interface-after-installing-driver-on-ubuntu20-04/280097/2

Bios updates, are updated:

❯ sudo fwupdmgr get-updates
[sudo] password for aemonge: 
Devices with no available firmware updates: 
 • UEFI Device Firmware
 • Fingerprint Sensor
 • Integrated Camera
 • ThinkPad Universal ThunderBolt 4 Dock
 • USB3.0 Hub
 • USB4 Retimer
 • VMM6212
Devices with the latest available firmware version:
 • KXG8AZNV1T02 LA KIOXIA
 • ThinkPad Thunderbolt 4 Dock
No updates available

What do you mean by -open driver version? do you mean extra/nvidia-open 545.29.06-12 ?

exactly

Thanks!

It’s one step ahead. Now nvidia-smi -L does recognize the GPU !

❯ nvidia-smi -L
GPU 0: NVIDIA RTX A500 Laptop GPU (UUID: GPU-762410c2-1c0d-ef4a-89ac-91afd926381b)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-9560a6c8-9dd9-59e3-70d7-05b9cb6bc495)

My python cuda script doesn’t though:

❯ python cuda-devics.py
CUDA is available.
Number of CUDA devices: 1
Device 0 name: NVIDIA GeForce RTX 3090

Only sees the stronger one.

But as far as nvidia support, I’m really happy !

Thanks a lot @generix !

Is CUDA_VISIBLE_DEVICES set to exclude one gpu?

Oups!

I didn in fact had set up in my ~/.profile the export CUDA_VISIBLE_DEVICES=0. I’ve removed it, and all is golden :)

Fixed:

❯ echo $CUDA_VISIBLE_DEVICES
0,1,2

❯ python cuda-devics.py
CUDA is available.
Number of CUDA devices: 2
Device 0 name: NVIDIA GeForce RTX 3090
Device 1 name: NVIDIA RTX A500 Laptop GPU
  id  load    free memory    used memory    total memory    temperature
----  ------  -------------  -------------  --------------  -------------
   0  0.0%    371.0MB        3325.0MB       4096.0MB        56.0C
   1  0.0%    4583.0MB       19464.0MB      24576.0MB       40.0C