I am running two services on a Debian 12 server:
Chatbot in Docker using NVIDIA driver 550.54.14, CUDA 12.4, and 3 GPUs.
Virtual Machine Manager (VMM) using 2 GPUs for VM creation.
The issue arises after implementing the chatbot: when the chatbot is running, the VMM stops working. I allocated the 2 GPUs to vfio-pci instead of nvidia, allowing VM creation, but I can only access the VM from the host server, not from other devices in the same network.
Here’s the setup:
- VM Access Inside the Server:
I can SSH into the VM from the host server:
(base) root@debian:~# ssh vm@192.168.122.74
vm@192.168.122.74's password:
Welcome to Ubuntu 22.04.5 LTS...
- VM Access from Another Device:
When attempting to SSH into the VM from another device:
PS C:\Users\cmcarvalho> ssh vm@192.168.220.102 -p 31270
ssh: connect to host 192.168.220.102 port 31270: Connection timed out
- NAT Configuration:
I confirmed that the network address translation (NAT) rules seem correct:
(base) root@debian:~# sudo nft list table ip nat
# Output omitted for brevity...
The PREROUTING chain forwards port 31270 to the internal IP 192.168.122.74, but the VM is still only accessible from the host server.
- Networking Service Failure:
The networking.service fails when the chatbot and GPUs are configured together:
(base) root@debian:~# sudo systemctl status networking
× networking.service - Raise network interfaces
Loaded: loaded (/lib/systemd/system/networking.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Thu 2024-11-28 11:21:24 WET; 22h ago
Docs: man:interfaces(5)
Main PID: 2358 (code=exited, status=1/FAILURE)
When I check the networking history, I see the following failure:
Nov 28 10:38:39 debian ifup[1873]: RTNETLINK answers: File exists
Nov 28 10:38:39 debian ifup[1663]: ifup: failed to bring up br0
After restarting the service, the server goes down, and when rebooted, CUDA and NVIDIA drivers are no longer present (nvidia-smi and nvcc do not work). The VM becomes accessible again from devices in the same network, but not when NVIDIA and CUDA are reinstalled. 5. Solution Attempts:
To resolve this, I attempted the following:
Allocated the GPUs to vfio-pci using driverctl:
sudo driverctl set-override 0000:1A:00.1 vfio-pci
sudo driverctl set-override 0000:1A:00.0 vfio-pci
options vfio-pci ids=10de:2230 # Ensure GPU binds to vfio-pci
sudo update-initramfs -u
But this results in the same issue: the networking.service
fails and the VM is only accessible from the host. 6. Network Interfaces:
(base) root@debian:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
inet 127.0.0.1/8 scope host lo
2: eno1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
inet6 fe80::216f:dc53:bfcd:e528/64 scope link
3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master br0 state UP group default qlen 1000
inet 192.168.220.102/24 scope global br0
4: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
inet 192.168.220.102/24 scope global br0
5: virbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
inet 192.168.122.1/24 scope global virbr0
6: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
inet 172.17.0.1/16 scope global docker0
- Question:
What could be causing the issue where the networking service fails and the VM is only accessible from the host after I configure the chatbot with the NVIDIA driver and GPUs in Docker? How can I ensure that both the chatbot and VMM work simultaneously on the same server without disrupting the networking?