eGPU graphics driver for CUDA with Isaac Gym

Hi,

I’m trying to get a graphics driver for my eGPU working that supports a CUDA version between 11.3 - 11.5 which I require for my research in Isaac Gym.

My specs:

  • Thinkpad P1 Gen 4 with a RTX A2000 Laptop GPU inside

  • Razor Core X Chroma eGPU enclosure with a GeForce RTX 3080 Ti inside, connected over Thunderbolt 4

  • Ubuntu 20.04.3 LTS (fresh install)

I tried the following:

  • 460 driver: The eGPU is recognised by nvidia-smi and even attaching an external monitor to it works as intended, but only CUDA 11.2 comes with it which is not compatible with the Isaac Gym Version I need to use as it gives an error that the the driver is too old.
    I tried forward compatibility packages for CUDA 11.4 and 11.5, and after running programs with them an error about trying forward compatibility with incompatible hardware is thrown.

  • 470 driver: The eGPU is not recognised by nvidia-smi, and it cannot be used for programs or external monitors (nevertheless it is connected to the system and authorized).

  • 495 driver: Same as for the 470 driver.

  • 510 driver: The eGPU is recognised by nvidia-smi and even attaching an external monitor to it works as intended, but CUDA 11.6 comes with it. The Isaac Gym version I’m using does not throw an error, but never properly starts and all I can get out of it is a black screen. I suppose that only CUDA 11.3 - 11.5 is supported by it (as my colleagues running these versions but on different hardware got it to work).

I would be very grateful for your support, as currently I cannot use my setup to continue my research. I am happy to provide any required information or reinstall whatsoever.

Thank you in advance!

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Thank you for your reply.

I did the following:

Thank you for support, I am happy to provide more information if required.

No idea why the 470+495 drivers are failing to load. I guess isaac gym is using cuda/gl interop, please check if it runs properly when you exclude the A2000 by setting
CUDA_VISIBLE_DEVICES=1
or
CUDA_VISIBLE_DEVICES=0

I changed CUDA_VISIBLE_DEVICES while running the 510 driver. I get a black screen in Isaac Gym with either GPU (I verified that it changed it by printing the current CUDA GPU’s name over PyTorch), but no further errors, only that Isaac Gym is not responding.
Though I don’t know if Isaac Gym is compatible with CUDA 11.6 which comes with the 510 driver. I’m only sure about compatibility with CUDA 11.3-11.5.

Furthermore I tried using a borrowed Laptop (Thinkpad T480) but with the same software and eGPU setup while using the 470 driver with CUDA 11.4 and there it works.

Then maybe check for a bios update since this might as well be a thunderbolt issue.

The BIOS is on the newest version, and PCIe Tunneling for Thunderbolt 4 is on; when I run on the 460 or 510 drivers the external monitor connected directly to the eGPU works properly.

Is it possible to disable the A2000 in bios?

The only option is to switch between hybrid and discrete graphics. I put it in discrete graphics mode (thereby disabling the internal intel graphics) and on the 510 driver it is now possible to launch Isaac Gym successfully! (On the earlier drivers it made no difference).

One issue remains though. The external monitor attached to the eGPU works now, but in discrete graphics mode the laptop monitor stays black with an “x” as a mouse pointer. Strangely, the laptop monitor appears in the nvidia settings, and it says that now the A2000 is powering the laptop monitor, and the 3080 Ti is powering the external monitor. Nevertheless the laptop monitor does not appear in the ubuntu settings, only the external monitor does. I attached another bug report, maybe it helps:
nvidia-bug-report.log.gz (604.9 KB)

Thank you for your help already!

You have set a config with two separate Xscreens in your xorg.conf. The internal display now has no WM. Rather try this:
delete xorg.conf
set kernel parameter nvidia-drm.modeset=1
reboot
If the second screen doesn’t come alive, run
xrandr --setprovideroutputsink NVIDIA-G0 NVIDIA-0 && xrandr --auto
If that works, the config might be tweaked to have the 3080 as primary.

1 Like

Thank you for your reply!

After deleting the xorg.conf, setting the kernel parameter nvidia-drm.modeset=1, and rebooting now the internal display is working but the external display is black with only a white underscore in the upper left corner. Furthermore, the external display is now not recognised by the ubuntu settings.

If I then try xrandr --setprovideroutputsink NVIDIA-G0 NVIDIA-0 && xrandr --auto I am facing several issues: --setprovideroutputsink does not seem to be a valid flag, --setprovideroutputsource or --setprovideroffloadsink are the only options. Furthermore, NVIDIA-G0 does not appear, I can only use NVIDIA-0 as a single argument.
The results of these commands look as follows:

 $ xrandr --setprovideroutputsource NVIDIA-0 && xrandr --auto
X Error of failed request:  BadValue (integer parameter out of range for operation)
  Major opcode of failed request:  140 (RANDR)
  Minor opcode of failed request:  35 (RRSetProviderOutputSource)
  Value in failed request:  0x218
  Serial number of failed request:  15
  Current serial number in output stream:  16
$ xrandr --setprovideroffloadsink NVIDIA-0 && xrandr --auto
X Error of failed request:  BadValue (integer parameter out of range for operation)
  Major opcode of failed request:  140 (RANDR)
  Minor opcode of failed request:  34 (RRSetProviderOffloadSink)
  Value in failed request:  0x218
  Serial number of failed request:  15
  Current serial number in output stream:  16

leaving the rest unchanged as they seem to fail.

‘setprovideroutputsource’ was the correct token, sorry.
Doesn’t matter though, since the second provider isn’t there, since I forgot to enable egpu. Please create /etc/X11/xorg.conf.d/11-nvidia-egpu.conf

Section "OutputClass"
    Identifier "nvidia-egpu"
    MatchDriver "nvidia-drm"
    Driver "nvidia"
    Option "AllowExternalGpus" "True"
EndSection

and try again.
Edit: changed snippet name for better visibility in logs.

1 Like

First of all thank you for the awesome support and the stellar response times. It is a pleasure.

By adding 11-nvidia-egpu.conf it was indeed possible to have both screens working properly!
Unfortunately, the configuration seems a bit strange:


I generated an xorg.conf from the nvidia settings:
xorg.conf (1.8 KB)
It seems that by default an X Screen is generated which only runs on the internal GPU and manages both displays (plus the detected resolution of the HDMI screen is wrong, it should be 1080p and not PRIME?). Furthermore, if I try to run Isaac Gym in this configuration using the 3080 Ti, everything crashes and I’m thrown back to the login screen, which seems to make sense as it is not responsible for the graphics.

I then tried to set the 3080 Ti as primary as follows (not sure if this is the way to go): I created a new X Screen out of the disabled one, which seems to be a non PRIME version of the external screen, and rebooted.
In that case the settings looked “right” but the external monitor stayed black with an “x” as a mouse pointer:


I again generated an xorg.conf from the nvidia settings:
xorg2.conf (2.7 KB)

The displayed monitors (one PRIME and one disabled) might be irritating but it’s correct.
To have the 3080 as primary, please delete your current xorg.conf and replace it with one only containing

Section "Device"
    Identifier     "Device1"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "NVIDIA GeForce RTX 3080 Ti"
    Option "AllowExternalGpus" "True"
    Option "PrimaryGpu" "yes"
    BusID          "PCI:34:0:0"
EndSection
1 Like

It worked! Everything is working, both screens are properly configured and Isaac Gym is running on the 3080 Ti.

Thank you so much for your help, it’s highly appreciated. Thanks to you so much trouble is gone now.
Best Regards

I am having similar problems as Teplotaxl. I have tried the steps in this thread and gotten to having my external monitor working. (Internal display not working) (eGPU with RTX 2080)

However when i try to launch the Isaac Gym examples, i get a black screen, as i think it is trying to run on my laptops internal 1050TI maxQ. If I set --sim_device cuda:1 --graphics_device_id 1. I get a very blurry picture and Isaac crashes shortly after.
I have followed the instructions of creating “11-nvidia-egpu” and “xorg.conf” in /etc/X11/

Do you maybe have an Idea of what could be wrong?



xorg.conf (269 Bytes)

11-nvidia-egpu.conf (149 Bytes)

This seems to be either a bug in the nvidia driver or in isaac. Please try to work around it by adding
Option "ProbeAllGpus" "false"
inside the device section of your xorg.conf.
Another approach would be using a udev rule to remove the 1050 from the bus so the system doesn’t see it.

Unfortunately this did not work, do you know of a great guide on how to disable the 1050 ti?
If i disable it, will i still be able to use the intel igpu when the egpu is not plugged in?

A follow up, if unplug the laptop from the eGPU to use it as a standalone laptop, I get a black screen with the files/blocks error and it won’t boot. Do I have to update the configuration that @generix suggested to use the laptop individually?

Thanks again for your help.

The config file needs to be removed beforehand. Unfortunately, there’s no way to make this config conditionally. You’d have to write a script which checks for the existence of the egpu on boot and then creates/deletes the config, started by systemd for automatism.

1 Like