Unstable Auto-Start of Windows VM on NVIDIA DGX Spark After Host Reboot (KVM + QEMU, GNU/Linux 6.11.0-1014-nvidia aarch64)

Description

When creating a Windows virtual machine on NVIDIA DGX Spark using KVM/QEMU, the Windows VM does not start reliably after the DGX Spark host is rebooted.

After a host reboot, the Windows VM service shows non-deterministic startup behavior:

  • Sometimes the Windows VM starts successfully.
  • Sometimes it fails during the UEFI boot stage with one of the following messages.

Error Case 1


BdsDxe: loading Boot0001 "UEFI Misc Device" from PciRoot(0x0)/Pci(0x4,0x0)
BdsDxe: starting Boot0001 "UEFI Misc Device" from PciRoot(0x0)/Pci(0x4,0x0)

Error Case 2


BdsDxe: loading Boot0008 "Windows Boot Manager" from HD(1,GPT,75101D2F-2B98-494A-86EE-AB33E5F98446,0x800,0x64000)/\EFI\Microsoft\Boot\bootmgfw.efi
BdsDxe: starting Boot0008 "Windows Boot Manager" from HD(1,GPT,75101D2F-2B98-494A-86EE-AB33E5F98446,0x800,0x64000)/\EFI\Microsoft\Boot\bootmgfw.efi


VM Startup Script

#!/bin/bash
echo "Starting Windows VM at $(date)"

exec /usr/bin/qemu-system-aarch64 \
  -name windows-arm-vm \
  -M virt,gic-version=3 \
  -accel kvm \
  -cpu host \
  -smp 4 \
  -m 8G \
  \
  -drive file=/opt/vm_storage/QEMU_EFI.fd,if=pflash,format=raw,unit=0,readonly=on \
  -drive file=/opt/vm_storage/QEMU_VARS.fd,if=pflash,format=raw,unit=1 \
  \
  -device ramfb \
  -device virtio-keyboard \
  -device virtio-mouse \
  \
  -drive file=/opt/vm_storage/windows11_arm.img,if=virtio,format=qcow2 \
  \
  -nic user,model=virtio-net-pci \
  \
  -vnc :0

Preflight Script

#!/bin/bash
set -e

echo "Windows VM preflight check starting..."

# Method 1: Hard wait for PCIe/NVMe training on Grace CPU (worst case)
echo "Hard wait 18 seconds for Grace CPU PCIe/NVMe training..."
sleep 18

# Method 2: Ensure VM disk image is accessible
for i in {1..30}; do
    if [ -r "/opt/vm_storage/windows11_arm.img" ] && [ -w "/opt/vm_storage/windows11_arm.img" ]; then
        echo "VM image file is accessible."
        break
    else
        echo "Image file not accessible yet, wait 2s... ($i/30)"
        sleep 2
    fi
done

# Method 3: Refresh EFI boot entries to avoid stale "Misc Device"
efibootmgr --refresh >/dev/null 2>&1 || true

echo "Preflight check passed, starting Windows VM in 3 seconds..."
sleep 3

systemd Service File

[Unit]
Description=Windows 11 ARM64 VM
After=network-online.target local-fs.target
Wants=network-online.target

[Service]
Type=simple
ExecStartPre=/opt/vm_storage/preflight.sh
ExecStart=/opt/vm_storage/start_vm.sh

# Critical parameters for DGX Spark
Restart=always
RestartSec=8
TimeoutStartSec=600
StartLimitIntervalSec=0

KillMode=process
LimitNOFILE=1048576

[Install]
WantedBy=multi-user.target

Are there any effective solutions to address this issue?

Experimentation is encouraged, just clarifying that virtualization is not officially supported on Spark DGX OS.

You won’t be able to share your GPU with the virtual machine and I don’t think there’s vGPU support for spark (or will ever have)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.