Accelerating Machine Learning on a Linux Laptop with an External GPU

Originally published at: https://developer.nvidia.com/blog/accelerating-machine-learning-on-a-linux-laptop-with-an-external-gpu/

With the introduction of Intel Thunderbolt 3 in laptops, you can now use an external GPU (eGPU) enclosure to use a dedicated GPU for gaming, production, and data science. A Thunderbolt 3 eGPU setup consists of A discrete GPUAn enclosure to house it inA power supplyA Thunderbolt 3 connection to the laptop Most enclosures provide…

Hi, this is Dhruv. Hope you enjoyed reading my blog. Setting up my TB3 eGPU came at the heels of using eGPUs throughout my time at university and trying to create a compact system when I joined NVIDIA during WFH. At university I used an Expresscard based eGPU setup with my T430 for my AI and Parallel Programming courses. When I got a TB3 machine, the theoretical performance increase from the improved bandwidth led me to getting an eGPU setup instead of a workstation. Since then, I’ve been very pleased with the performance and portability of the solution. It’s hard to beat being able to work on the couch and then plug into the eGPU on the desk for some compute.
I hope you have a great experience if you decide to use an eGPU for your work. If you have any questions or comments, let me know, and we can try to resolve them :)

Hi Dhruv:

Our company (https://kfocus.org) is working with organizations like JPL and other big data users that have interest in eGPU. We really appreciate your post and am very interested in pursuing providing solutions for them. However, it is not clear how to use the iGPU, dGPU, and eGPU concurrently - for example, use the dGPU for display and then the eGPU for blender rendering. Might you have recommendations on resources for that?

Any help would be greatly appreciated!

Sincerely, Mike

Hi Druv: Is this something you can help with? Cheers, Mike

Hi @deppman
While there isn’t a turnkey solution to the iGPU+dGPU+eGPU problem, there are a couple of ways of going about creaing a solution.
You could set the iGPU to be the X screen renderer on battery and the dGPU to be the X screen renderer on AC. There are some tools within Ubuntu for this like gpu-manager/prime-select or you can use Offloading Graphics Display with RandR 1.4.

When it comes to using the eGPU, you can use it either as a PRIME Render Offload device or a Compute device, depending on the task. PRIME Render Offload is meant for applications that require the X screen to be “rendered” on a different GPU for example Blender(an older alternative was Bumblebee). Compute is meant for CUDA/Accelerated Data Science tasks.

For my machine, I’m using the eGPU as my primary X renderer as I don’t have a dGPU. For a laptop with an iGPU + dGPU + eGPU, I’d imagine that wouldn’t be the case, and if it is, then you can use “AllowExternalGpus” to use the eGPU as the primary X renderer. Otherwise, you could “Prime Render Offload” Blender to the eGPU/dGPU (depending on what is connected and what is the power source) and use the eGPU as a compute device otherwise if it is connected.

In case you have both the dGPU and eGPU connected and can’t use Prime Render Offload, you’d have to rely on some other way of hiding the other GPU. A way that I’ve been using has been docker with the --gpus or NVIDIA_VISIBLE_DEVICES flag/envvar. Some other applications like OBS allow you select the GPU if you happen to have multiple GPUs in you system.

Here are some links to relevant information:
http://us.download.nvidia.com/XFree86/Linux-x86_64/455.28/README/optimus.html
http://us.download.nvidia.com/XFree86/Linux-x86_64/455.28/README/randr14.html
http://us.download.nvidia.com/XFree86/Linux-x86_64/455.28/README/primerenderoffload.html
https://wiki.archlinux.org/index.php/bumblebee

1 Like

Thanks for the great tutorial. Is it possible to use two eGPUs as well?

Hi Dsinga:

Sorry for the delay, but I must have missed the notification. Thank you for all your help. We will follow up on your articles. For our customers’ purposes, the use of the eGPU as an add-on compute unit is the most common use and here the CoreX is working great, and we have been able to run dGPU + eGPU to run two separate GPGPU workloads concurrently. I will report back when we resume testing. Thanks again!

Sincerely, Mike

great info here. I have a Razer laptop with Quadro5000RTX. I’m wondering if I get a CoreX with a Desktop Quadro 5000RTX, will I be able to use both at the same time (mobile Quadro5000+ eGPUDesktopQuadro5000) for my DL training in principle?

Hi Dhruv,

Really appreciate the detail your post goes into. I am however unclear about the following:

Make sure that the NVIDIA GPU is detected by the system and a suitable driver is loaded:

$ lspci | grep -i “nvidia”

$ lsmod | grep -i “nvidia”

The existing driver is most likely Nouveau, an open-source driver for NVIDIA GPUs. Because Nouveau doesn’t support eGPU setups, install the NVIDIA CUDA and NVIDIA drivers instead. You must also stop the kernel from loading Nouveau.
Get the latest version of the NVIDIA CUDA Toolkit for your distribution. For Ubuntu 20.04, this toolkit is available from the standard repository:

$ sudo apt-get install nvidia-cuda-toolkit

Does this mean if lsmod | grep -i "nvidia" returns nothing that I need to install the NVIDIA drivers using sudo apt-get install nvidia-cuda-toolkit? It seems to me from your post that the drivers and cuda toolkit should all be installed using that command (unless installing the drivers is outside the scope of your post).

However, when I check the dependencies of nvidia-cuda-tookit on Ubuntu 20.04 using apt, there are no nvidia-driver dependencies. Are the drivers contained in some of the libnvidia or nvidia-cuda packages?

Thanks for your time.

EDIT: installed nvidia-cuda-toolkit and confirmed that the drivers are not installed. Thus the installation of CUDA and drivers is likely outside of the scope of this post.

Sure! As long as you have enough PCIe lanes provided by your CPU+motherboard to your Thunderbolt ports(check with your laptop/NUC manufacturer or see if they have a engineering diagram for it) you can run 2 eGPUs. Use nvidia-settings to configure your displays and NVIDIA GPUs

1 Like

Yes, if you’re trying to use your laptop dGPU or iGP to drive the display don’t add the “AllowExternalGpus” “True” to the 10-nvidia.conf xorg config file since that makes it so that the eGPU drives Xorg and thus the displays. Most machines also support connecting an eGPU after boot(although disconnecting while booted can cause Xorg to crash or a kernel panic).
The Desktop Quadro 5000 would show up as another GPU in your system along with your mobile Quadro 5000. Then you’d have to structure your code to leverage multi-GPU through layer or model parallelism. That said, you might find a difference in the bandwidth between your eGPU and dGPU, so in order to get the best training performance you should profile to see if you’re bandwidth bound for the eGPU and if you are, have different batch sizes based on the ability of the GPU to iterate through it.

1 Like

lsmod returns a list of modules loaded by the Linux kernel. If the $lspci | grep -i “nvidia” command shows that there is a NVIDIA GPU connected to the PCI bus, and $lsmod | grep -i “nvidia” returns nothing, you either don’t have the NVIDIA driver installed, or there’s something wrong with the driver installation that doesn’t allow the kernel to load it which is very unlikely. Regarding nvidia-cuda-toolkit providing drivers, it should provide you with NVIDIA driver. Once you install nvidia-cuda-toolkit, what’s the output of $nvidia-smi and $lsmod | gpre -i “nvidia”. It might be that nvidia-cuda-toolkit installed the driver but didn’t add the location of the nvidia-smi binary to PATH.

2 Likes

So I have to be honest here and say that this isnt the right method for my system. I found as long as I enabled thunderbolt I could see both the GPU in my laptop and I could also use the eGPU be recognized for CUDA. The minute I added in the comment on “AllowExternalGpus” “True” in Grub, I lost the ability to use the HDMI input form my eGPU. This command would show up in my display setting as a second display but my mouse wouldnt move anywhere. In the end I went back to level 3 and my external monitor coming from the eGPU showed up without this command. the nvidia-smi command was showing it before hand. So I am not sure this should be kept around as an instruction…

What does the command actually do? It doesnt really make sense to me. If you have a thunderbolt and you check the nvidia-smi showing both are connected. I think the blog should give a pre warning at the point of the section saying try. When I did this I know cant get my second monitor to get picked up even though it worked just before these changes. So a little annoyed… to say the least…

Probably best to stick in a bit clearer that in your case (which you mention but should be clearer) that your set up is with a laptop without a gpu?

edit:
As a warning, this appears to basically wipe all the setting that nvidia run automatically such as turning off nouveau. Now having turned it off, the grub loads on the egpu hdmi but is shut down as soon as it boots to a gui.

2nd time… after 30 mins of trying i just swapped the hdmi to my native laptop gpu and it seems to be working fine. At leats the egpu is free for other stuff and cuda 11.0 can see it with nvcc…

Spent hours trying to get an eGPU to work with my laptop. No dice. Just a complete and total nightmare and frankly not worth spending days-to-weeks battling with poorly written software and random chipsets that may or may not work. LINUX has a long way to go in the eGPU area. Until someone makes a completely straight-forward, “drivers included with enclosure” system (like winblows) for LINUX (Ubuntu), an eGPU is a really mixed bag. Distro, enclosure, drivers, chipsets, cards, laptops models => all can have a major effect on the ability to make this work.

Not sure if this is a mistake or just something that varies between systems, but this didn’t work for me until I changed “Option “AllowExternalGpus” “True”” to “Option “AllowExternalGpus” “true””. Took me a while to locate the issue so I thought I’d try and save someone else the hassle!

Thanks! Looking at xorg.conf, it looks like any of 1 , on , true , yes would be accepted as the boolean True when encased in quotation marks.

Fun fact, the massively oversized PCI BAR on the Tesla K80 doesn’t play well with Thunderbolt 3 eGPU solutions. I haven’t found a documented limitation in Thunderbolt 3, but it “doesn’t work anywhere I try it” lol.

1 Like

I fixed this; turns out it wasn’t me, but rather the PCI board that SONNET Breakaway Box 750 now uses is not LINUX compliant. This was disappointing as SONNET was one of the eGPU enclosure manufacturers that were producing good quality compatible hardware. Now they just produce Windows only (read you need a hacky driver) hardware. Tried again with a Razer Core X eGPU enclosure and a MSI RTX 3060Ti 3x OC Ventus card. Works like a charm! The Razer Core series is fully compliant. I would really, really like to see hot-plug work at some time, but I gather the window manager Wayland is almost there (on Ubuntu).

Hi @dsingalNV,

Thanks for the informative post.! I have some difficulty with the setup though.

I use a Sonic breakaway box 550 with a desktop 3070 RTX card attached to it. The laptop that I use has a 3050 Ti internal card. My application requires computing to be done on the egpu, while the xserver and all renders need to run on the internal 3050 Ti card.

I use ubuntu 20.04 and installed the correct nvidia drivers. nvidia-smi only shows the internal graphics card and not the graphics card connected through thunderbolt. But lspci shows both cards and boltctl shows the correct thunder bolt device connected. As I understand it, since I don’t want X to run on the eGPU, I don’t really have to change any xconfig files. I’m not sure what else to do.

I don’t know why nvidia-smi doesn’t identify the eGPU. Can you help me out?

Cheers,
Varun

Hi, Dhruv.

I recently tried a configuration of a host with Thunderbolt 4 connected to a Thunderbolt 4 hub that is connected to three Thunderbolt-to-PCIe enclosures which each contain
a GTX 1060.

Each GTX 1060 enumerated in Windows 11 22H2 and Ubuntu 22.04 (5.15-60) with current Nvidia drivers ( as of 2023-04-06).

What a pleasant surprise!

I used a parallel test case in MATLAB to confirm all GPUs were utilized.

I do not know if this configuration works in general with other computer models, Thunderbolt devices and Nvidia GPU models.

I work at a company that makes Thunderbolt devices.

Could we get in touch to share notes? Or could you introduce me to a colleague involved with product compatibility testing?