How can I run both a chatbot with GPUs in Docker and a Virtual Machine Manager (VMM) on the same server without networking issues in Debian 12?f

I am running two services on a Debian 12 server:

Chatbot in Docker using NVIDIA driver 550.54.14, CUDA 12.4, and 3 GPUs.
Virtual Machine Manager (VMM) using 2 GPUs for VM creation.

The issue arises after implementing the chatbot: when the chatbot is running, the VMM stops working. I allocated the 2 GPUs to vfio-pci instead of nvidia, allowing VM creation, but I can only access the VM from the host server, not from other devices in the same network.

Here’s the setup:

  1. VM Access Inside the Server:

I can SSH into the VM from the host server:

(base) root@debian:~# ssh vm@192.168.122.74
vm@192.168.122.74's password:
Welcome to Ubuntu 22.04.5 LTS...
  1. VM Access from Another Device:

When attempting to SSH into the VM from another device:

PS C:\Users\cmcarvalho> ssh vm@192.168.220.102 -p 31270
ssh: connect to host 192.168.220.102 port 31270: Connection timed out
  1. NAT Configuration:

I confirmed that the network address translation (NAT) rules seem correct:

(base) root@debian:~# sudo nft list table ip nat
# Output omitted for brevity...

The PREROUTING chain forwards port 31270 to the internal IP 192.168.122.74, but the VM is still only accessible from the host server.

  1. Networking Service Failure:

The networking.service fails when the chatbot and GPUs are configured together:

(base) root@debian:~# sudo systemctl status networking
× networking.service - Raise network interfaces
     Loaded: loaded (/lib/systemd/system/networking.service; enabled; preset: enabled)
     Active: failed (Result: exit-code) since Thu 2024-11-28 11:21:24 WET; 22h ago
       Docs: man:interfaces(5)
   Main PID: 2358 (code=exited, status=1/FAILURE)

When I check the networking history, I see the following failure:

Nov 28 10:38:39 debian ifup[1873]: RTNETLINK answers: File exists
Nov 28 10:38:39 debian ifup[1663]: ifup: failed to bring up br0

After restarting the service, the server goes down, and when rebooted, CUDA and NVIDIA drivers are no longer present (nvidia-smi and nvcc do not work). The VM becomes accessible again from devices in the same network, but not when NVIDIA and CUDA are reinstalled. 5. Solution Attempts:

To resolve this, I attempted the following:

Allocated the GPUs to vfio-pci using driverctl:

sudo driverctl set-override 0000:1A:00.1 vfio-pci
sudo driverctl set-override 0000:1A:00.0 vfio-pci
options vfio-pci ids=10de:2230  # Ensure GPU binds to vfio-pci
sudo update-initramfs -u

But this results in the same issue: the networking.service fails and the VM is only accessible from the host. 6. Network Interfaces:

(base) root@debian:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    inet 127.0.0.1/8 scope host lo
2: eno1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    inet6 fe80::216f:dc53:bfcd:e528/64 scope link
3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master br0 state UP group default qlen 1000
    inet 192.168.220.102/24 scope global br0
4: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    inet 192.168.220.102/24 scope global br0
5: virbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    inet 192.168.122.1/24 scope global virbr0
6: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    inet 172.17.0.1/16 scope global docker0
  1. Question:

What could be causing the issue where the networking service fails and the VM is only accessible from the host after I configure the chatbot with the NVIDIA driver and GPUs in Docker? How can I ensure that both the chatbot and VMM work simultaneously on the same server without disrupting the networking?

Thanks for the post. I think this would be better moved to the Nvidia Developer Forums out of the specific Omniverse Forums. This is not really related. I will find out where to move this to. Thanks.

@TomNVIDIA

Not sure where this should go, I moved this to the AI Foundation Models forum. Hopefully someone here may know how to toubleshoot the issue.