How can I run both a chatbot with GPUs in Docker and a Virtual Machine Manager (VMM) on the same server without networking issues in Debian 12?

cmcarvalho · November 29, 2024, 3:03pm

I am running two services on a Debian 12 server:

Chatbot in Docker using NVIDIA driver 550.54.14, CUDA 12.4, and 3 GPUs.
Virtual Machine Manager (VMM) using 2 GPUs for VM creation.

The issue arises after implementing the chatbot: when the chatbot is running, the VMM stops working. I allocated the 2 GPUs to vfio-pci instead of nvidia, allowing VM creation, but I can only access the VM from the host server, not from other devices in the same network.

Here’s the setup:

VM Access Inside the Server:

I can SSH into the VM from the host server:

(base) root@debian:~# ssh vm@192.168.122.74
vm@192.168.122.74's password:
Welcome to Ubuntu 22.04.5 LTS...

VM Access from Another Device:

When attempting to SSH into the VM from another device:

PS C:\Users\cmcarvalho> ssh vm@192.168.220.102 -p 31270
ssh: connect to host 192.168.220.102 port 31270: Connection timed out

NAT Configuration:

I confirmed that the network address translation (NAT) rules seem correct:

(base) root@debian:~# sudo nft list table ip nat
# Output omitted for brevity...

The PREROUTING chain forwards port 31270 to the internal IP 192.168.122.74, but the VM is still only accessible from the host server.

Networking Service Failure:

The networking.service fails when the chatbot and GPUs are configured together:

(base) root@debian:~# sudo systemctl status networking
× networking.service - Raise network interfaces
     Loaded: loaded (/lib/systemd/system/networking.service; enabled; preset: enabled)
     Active: failed (Result: exit-code) since Thu 2024-11-28 11:21:24 WET; 22h ago
       Docs: man:interfaces(5)
   Main PID: 2358 (code=exited, status=1/FAILURE)

When I check the networking history, I see the following failure:

Nov 28 10:38:39 debian ifup[1873]: RTNETLINK answers: File exists
Nov 28 10:38:39 debian ifup[1663]: ifup: failed to bring up br0

After restarting the service, the server goes down, and when rebooted, CUDA and NVIDIA drivers are no longer present (nvidia-smi and nvcc do not work). The VM becomes accessible again from devices in the same network, but not when NVIDIA and CUDA are reinstalled. 5. Solution Attempts:

To resolve this, I attempted the following:

Allocated the GPUs to vfio-pci using driverctl:

sudo driverctl set-override 0000:1A:00.1 vfio-pci
sudo driverctl set-override 0000:1A:00.0 vfio-pci
options vfio-pci ids=10de:2230  # Ensure GPU binds to vfio-pci
sudo update-initramfs -u

But this results in the same issue: the networking.service fails and the VM is only accessible from the host. 6. Network Interfaces:

(base) root@debian:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    inet 127.0.0.1/8 scope host lo
2: eno1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    inet6 fe80::216f:dc53:bfcd:e528/64 scope link
3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master br0 state UP group default qlen 1000
    inet 192.168.220.102/24 scope global br0
4: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    inet 192.168.220.102/24 scope global br0
5: virbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    inet 192.168.122.1/24 scope global virbr0
6: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    inet 172.17.0.1/16 scope global docker0

Question:

What could be causing the issue where the networking service fails and the VM is only accessible from the host after I configure the chatbot with the NVIDIA driver and GPUs in Docker? How can I ensure that both the chatbot and VMM work simultaneously on the same server without disrupting the networking?

Topic		Replies	Views
A problem in using Nvidia graphics cards Visualization	1	707	December 29, 2022
nvidia-docker inside Kubernetes - Failed to initialize NVML: Unknown Error CUDA Setup and Installation	3	4006	January 9, 2022
NVIDIA GPU Not Recognized in Windows Hyperv VM - Code 12 Driver Error Drivers - Linux, Windows, MacOS cuda , nvidia-smi	4	5170	December 21, 2023
Broken GPU state query failure in AMD + H100 Confidential Computing	10	895	February 15, 2024
simpleP2P fails on 8*L40S server CUDA Programming and Performance cuda	1	490	January 22, 2024
deviceQuery hangs on gpu id 0 Linux	0	264	December 26, 2022
PCIE Bus Error with two NVIDIA cards on Linux Linux	3	2939	October 14, 2021
Rror getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory General Topics and Other SDKs gpu	9	1323	October 19, 2023
/dev/nvidia-uvm IO error on Ubuntu 22.04, 520 to 535 driver versions Linux cuda , opencl , linux-driver	2	2764	August 27, 2023
HGX A100 VM passthrough issues on Ubuntu 20.04 Linux	6	4870	September 14, 2021

How can I run both a chatbot with GPUs in Docker and a Virtual Machine Manager (VMM) on the same server without networking issues in Debian 12?

Related topics