Immich picture library (alternative to google photos / icloud Photos)

Hey guys,

I cam across a good alternative to google photos / icloud photos.

the nice thing is that the immich app can be “split” in different parts

Main Server The main immich app can run on my Intel NUC thats where the App and the Database runs. (hosted in a Docker Container on a proxmox LXC Container on my Intel NUC)

The Machine Learning Server that installation did not work out of the box cause of Linux/ARM64/Cuda13 system of the DGX Spark. It is hosted in a Docker Container on the DGX Spark - See this discussion on github

The Image and Video Files (hosted on my Synology NAS - connected as a mounted volume to the NUC) I host my pictures on a Synology NAS and I do not want to move them to the Spark cause the NAS has more storage and I need the SSD for LLMs.

The iOS / Android app | iOS | Android

Everything from here is AI generated - Took a long time to figure it out and I wanted to share it with you guys - and also with my Future me when I come back and look for the Tutorial.

Immich Server Stack — Intel NUC (Proxmox LXC / Portainer)

Immich server stack running in a Debian LXC container on Proxmox, managed via Portainer.
Machine learning is offloaded to a separate NVIDIA DGX Spark
via MACHINE_LEARNING_URL.


Architecture

┌─────────────────────────────────────┐     ┌──────────────────────────────┐
│  Intel NUC  (Proxmox / Debian LXC)  │     │  NVIDIA DGX Spark (ARM64)    │
│                                     │     │                              │
│  immich-server  :2283               │────▶│  immich-machine-learning     │
│  immich-db      (postgres)          │     │  :2284  (CUDA, GB10 GPU)     │
│  immich-redis                       │     └──────────────────────────────┘
│                                     │
│  /mnt/Diskstation-*  (Synology NAS)       │
└─────────────────────────────────────┘

Photos are served from a Synology NAS via bind mounts into the LXC container.


Prerequisites

Step 1 — Enable NFS on the Synology NAS

  1. Control Panel → File Services → NFS → enable NFS service (NFSv4.1 recommended)
  2. For each shared folder to expose:
    • Control Panel → Shared Folder → select folder → Edit → NFS Permissions → Create
    • Set Allowed IP / CIDR to your Proxmox host IP (e.g. 192.168.178.10) or subnet (192.168.178.0/24)
    • Privilege: Read/Write
    • Squash: No mapping (or Map root to admin if you have permission issues)
    • Note the Mount path shown — it will be something like /volume1/immich

Step 2 — Mount NAS shares on the Proxmox host

Install NFS client on the Proxmox host (not inside the LXC):

apt-get install -y nfs-common

Create local mount points:

mkdir -p /mnt/Diskstation-immich
mkdir -p /mnt/Diskstation-Photos-USER_01
# ... repeat for each share

Add entries to /etc/fstab on the Proxmox host — use _netdev so the mounts wait for the
network to be up before mounting on boot:

# Synology NAS mounts — replace 192.168.178.x with your NAS IP
192.168.178.x:/volume1/immich        /mnt/Diskstation-immich        nfs  defaults,_netdev,nfsvers=4.1,soft,timeo=30  0  0
192.168.178.x:/volume1/Photos-USER_01  /mnt/Diskstation-Photos-USER_01  nfs  defaults,_netdev,nfsvers=4.1,soft,timeo=30  0  0
# ... repeat for each share

Test the mounts without rebooting:

mount -a
df -h | grep Diskstation

Tip — soft vs hard mounts: soft causes operations to fail gracefully if the NAS
is unreachable (avoids hanging processes). hard retries forever. For a photo library
soft is safer; for the upload volume you may prefer hard to avoid data loss on
a brief network hiccup.


Step 3 — Pass mounts into the LXC container

Add bind mount entries to the LXC container config on Proxmox
(/etc/pve/lxc/<container-id>.conf):

mp0: /mnt/Diskstation-immich,mp=/mnt/Diskstation-immich
mp1: /mnt/Diskstation-Photos-USER_01,mp=/mnt/Diskstation-Photos-USER_01
mp2: /mnt/Diskstation-Photos-USER_02,mp=/mnt/Diskstation-Photos-USER_02
mp3: /mnt/Diskstation-Photos-USER_03,mp=/mnt/Diskstation-Photos-USER_03
mp4: /mnt/Diskstation-Photos-USER_04,mp=/mnt/Diskstation-Photos-USER_04

Restart the LXC container to apply the new mount points:

pct stop <container-id> && pct start <container-id>

Verify inside the container:

ls /mnt/Diskstation-immich

Boot order gotcha: The NFS mounts are on the Proxmox host, not inside the LXC.
If Proxmox boots and starts the LXC before the NFS mounts are ready, the bind mounts
will be empty directories. Use restart: unless-stopped (not on-failure) in
docker-compose so containers restart automatically once the mounts appear.


Step 4 — Create required Immich upload subdirectories

Run once on the Proxmox host (or inside the LXC) after the NAS is mounted:

Create required upload subdirectories on the NAS on first run:

mkdir -p /mnt/Diskstation-immich/upload/{encoded-video,thumbs,backups,library,profile}
touch /mnt/Diskstation-immich/upload/encoded-video/.immich
touch /mnt/Diskstation-immich/upload/thumbs/.immich
touch /mnt/Diskstation-immich/upload/backups/.immich
touch /mnt/Diskstation-immich/upload/library/.immich
touch /mnt/Diskstation-immich/upload/profile/.immich

Step 5 — Machine learning URL

After deploying, set the ML URL in Immich admin UI:
Administration → System Settings → Machine Learning → URL
http://<dgx-spark-ip>:2284

(The env var MACHINE_LEARNING_URL sets the default, but the admin UI value takes precedence
once the server has started for the first time.)


docker-compose.yml

services:
  immich-redis:
    image: redis:7-alpine
    container_name: immich-redis
    hostname: immich-redis
    security_opt:
      - no-new-privileges:true
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5
    volumes:
      - immich-redis:/data
    restart: unless-stopped

  immich-db:
    image: ghcr.io/immich-app/postgres:16-vectorchord0.4.3-pgvectors0.2.0
    container_name: immich-db
    hostname: immich-db
    security_opt:
      - no-new-privileges:true
    healthcheck:
      test: ["CMD", "pg_isready", "-q", "-d", "immich", "-U", "immich-db-user"]
      interval: 10s
      timeout: 5s
      retries: 5
    shm_size: 128mb
    volumes:
      - immich-db:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: immich
      POSTGRES_USER: immich-db-user
      POSTGRES_PASSWORD: changeme          # ← set a strong password
      DB_STORAGE_TYPE: SSD
    restart: unless-stopped

  immich-server:
    image: ghcr.io/immich-app/immich-server:release
    container_name: immich-server
    hostname: immich-server
    security_opt:
      - no-new-privileges:true
    ports:
      - "2283:2283"
    volumes:
      - /mnt/Diskstation-immich:/usr/src/app/upload:rw
      # External photo libraries (Synology NAS bind mounts):
      - /mnt/Diskstation-Photos-USER_01:/usr/src/app/Diskstation-Photos-USER_01:rw
      - /mnt/Diskstation-Photos-USER_02:/usr/src/app/Diskstation-Photos-USER_02:rw
      - /mnt/Diskstation-Photos-USER_03:/usr/src/app/Diskstation-Photos-USER_03:rw
      - /mnt/Diskstation-Photos-USER_04:/usr/src/app/Diskstation-Photos-USER_04:rw
    environment:
      IMMICH_LOG_LEVEL: log
      DB_HOSTNAME: immich-db
      DB_PORT: '5432'
      DB_DATABASE_NAME: immich
      DB_USERNAME: immich-db-user
      DB_PASSWORD: changeme               # ← must match POSTGRES_PASSWORD above
      REDIS_HOSTNAME: immich-redis
      MACHINE_LEARNING_URL: http://192.168.178.8:2284   # ← DGX Spark IP
    restart: unless-stopped
    depends_on:
      immich-db:
        condition: service_healthy
      immich-redis:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:2283/api/server/ping"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 60s

volumes:
  immich-db:
  immich-redis:

External Libraries

For each NAS photo folder, add an External Library in Immich:

Administration → External Libraries → Create Library

Recommended exclusion patterns (Synology NAS):

Pattern Reason
**/@eaDir/** Synology extended attributes temp dir
**/._.* macOS metadata files
**/#recycle/** Synology recycle bin
**/#snapshot/** Synology snapshots
**/.stversions/** Syncthing versions
**/.stfolder/** Syncthing folder marker
*.bat Windows batch files
Thumbs.db Windows thumbnail cache

Notes

  • restart: unless-stopped is required on all services. on-failure will not restart
    after a clean exit (e.g. on host reboot when NAS mounts aren’t ready yet), causing
    ENOTFOUND immich-db errors on next boot.
  • Immich v2.7.5+ uses internal port 2283 (previously 3001). The healthcheck URL
    must match: http://localhost:2283/api/server/ping.
  • The MACHINE_LEARNING_URL env var sets the default on first startup. If you change it
    later, update it in the admin UI — the database-stored value takes precedence over the
    env var after first boot.

Tested on: Intel NUC, Proxmox 8, Debian LXC, Immich v2.7.5

GPU Acceleration for NVIDIA DGX Spark GB10 on Immich (ARM64, Blackwell, CUDA 13.0)

Working fix for CUDAExecutionProvider not registering on the NVIDIA DGX Spark GB10 with ORT 1.24.4.
Based on the original work by @volschin.


Hardware & Environment

Device NVIDIA DGX Spark
GPU GB10 (Blackwell, SM_121, compute capability 12.1)
Architecture ARM64 (aarch64)
CUDA 13.0
Driver 580.126.09
OS Ubuntu 24.04
ORT built v1.24.4

Symptom

Running immich-machine-learning with DEVICE=cuda — all models silently fall back to CPU:

INFO  Setting execution providers to ['CPUExecutionProvider'], in descending order of preference

ort.get_available_providers() returns only ['AzureExecutionProvider', 'CPUExecutionProvider']
even though ort.get_all_providers() lists CUDAExecutionProvider as compiled in.

The ORT warning logged at startup:

[W:onnxruntime:Default, device_discovery.cc:211 DiscoverDevicesForPlatform]
GPU device discovery failed: device_discovery.cc:91 ReadFileContents
Failed to open file: "/sys/class/drm/card0/device/vendor"

Root Cause

Issue 1 — ORT 1.24.4 DRM sysfs device discovery

ORT 1.24.4 introduced hardware device discovery via DRM sysfs in
onnxruntime/core/platform/linux/device_discovery.cc.
GetGpuDevices() scans /sys/class/drm/cardN/device/vendor to build a hardware device list.
The CUDA EP only registers itself as available if it finds a device with vendor_id == 0x10de in that list.

The DGX Spark GB10 is a SoC/platform GPU — not a PCIe card.
DRM entries exist in /sys/class/drm/ (card0, card1, renderD128), but the device/vendor file
is not created because there is no PCI vendor ID for an SoC-integrated GPU.

The original code used ORT_RETURN_IF_ERROR(GetGpuDeviceFromSysfs(...)) inside the DRM loop.
When reading the vendor file failed, it immediately aborted GetGpuDevices() — the existing
PCI bus fallback (/sys/bus/pci/devices/) was never reached.

The GB10 also has no PCI bus entry (confirmed: find /sys/bus/pci/devices/ -name "vendor" -exec grep -il "10de" {} \; returns nothing), so the PCI fallback would also find nothing even if reached.

Issue 2 — Dual ORT installation shadowing

When uv sync --extra cpu is followed by uv pip install --reinstall onnxruntime_gpu-*.whl:

  • uv sync --extra cpu installs onnxruntime (CPU package), including
    libonnxruntime.so.1.24.1 and onnxruntime_pybind11_state.cpython-311-aarch64-linux-gnu.so
  • onnxruntime and onnxruntime-gpu are different package names in pip/uv —
    reinstalling one does not remove the other’s files
  • Python’s import system prefers the ABI-tagged .cpython-311-aarch64-linux-gnu.so extension
    over a plain .so, so the old unpatched CPU binary is loaded at runtime regardless of the
    GPU wheel being present

Fix

Three changes to Dockerfile.dgx-spark:

Patch 1 & 2 — device_discovery.cc

Applied via a Python RUN step before ./build.sh:

  • Fix 1 — Skip DRM cards with missing vendor files (continue) instead of aborting
    (ORT_RETURN_IF_ERROR). This allows gpu_devices to remain empty and reach the PCI fallback.
  • Fix 2 — After PCI fallback also finds nothing, check for /dev/nvidia0
    (present when NVIDIA_VISIBLE_DEVICES=all is set via the NVIDIA Container Runtime)
    and inject a synthetic OrtHardwareDevice{vendor_id=0x10de, type=GPU} so CUDA EP can register.

Patch 3 — Stage 2 wheel installation

Explicitly uninstall the CPU onnxruntime package before installing the GPU wheel,
so no old .so files shadow the new ones.


Result

[W] device_discovery.cc:283 GetGpuDevices] Skipping DRM card (no sysfs vendor info): ...
INFO  Loading detection model 'buffalo_l' to memory
INFO  Setting execution providers to ['CUDAExecutionProvider', 'CPUExecutionProvider'],
      in descending order of preference

nvidia-smi shows the ML workers using ~9 GB of GPU memory for the loaded models.


Files

Dockerfile.dgx-spark

# Dockerfile for DGX Spark (ARM64 + NVIDIA Blackwell GB10, CUDA 13.0)
# Builds onnxruntime-gpu from source since no arm64 PyPI wheel exists.
#
# Based on: https://github.com/volschin/immich/commit/fab7df3371d522f12a4b780b3c2b837f341b88bb
# Additional fixes for ORT 1.24.4 device_discovery.cc (SoC GPU, no PCI vendor in sysfs)
# and dual-package shadowing (onnxruntime CPU wheel overriding GPU wheel at runtime).
# See: https://github.com/immich-app/immich/discussions/10647

# ---------------------------------------------------------------------------
# Stage 1: Build onnxruntime-gpu from source for ARM64 + Blackwell (SM_121)
# ---------------------------------------------------------------------------
FROM nvidia/cuda:13.0.2-cudnn-devel-ubuntu24.04 AS builder-ort

# renovate: datasource=github-tags depName=microsoft/onnxruntime
ARG ORT_VERSION="v1.24.4"

# Ubuntu 24.04 ships Python 3.12; install 3.11 from deadsnakes PPA
RUN apt-get update && apt-get install -y --no-install-recommends \
    software-properties-common && \
    add-apt-repository -y ppa:deadsnakes/ppa && \
    apt-get update && apt-get install -y --no-install-recommends \
    python3.11 python3.11-dev python3.11-venv python3-pip \
    cmake git g++ && \
    rm -rf /var/lib/apt/lists/* && \
    update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1

# ORT build needs numpy for Python::NumPy CMake target
RUN python3 -m pip install --break-system-packages --ignore-installed numpy packaging wheel setuptools

RUN git clone --depth 1 --branch ${ORT_VERSION} --recurse-submodules --shallow-submodules \
    https://github.com/microsoft/onnxruntime.git /onnxruntime

WORKDIR /onnxruntime

# Patch: Remove 120 from ARCHITECTURES_WITH_ACCEL to avoid building sm_120a
RUN sed -i 's/set(ARCHITECTURES_WITH_ACCEL "90" "100" "101" "120")/set(ARCHITECTURES_WITH_ACCEL "100" "101" "122")/' \
    cmake/CMakeLists.txt && \
    grep -r 'ARCHITECTURES_WITH_ACCEL' cmake/ | head -5

# Patch device_discovery.cc for DGX Spark GB10 (SoC GPU, no PCI vendor in sysfs).
#
# Problem: ORT 1.24.4 introduced hardware device discovery via DRM sysfs
# (/sys/class/drm/cardN/device/vendor). GetGpuDevices() calls
# GetGpuDeviceFromSysfs() per DRM card, which reads device/vendor.
# The GB10 is a SoC/platform GPU — DRM entries exist but no PCI vendor file
# is created. ORT_RETURN_IF_ERROR aborts GetGpuDevices() immediately, so the
# existing PCI fallback and any further logic is never reached and CUDA EP
# stays unregistered (get_available_providers() returns only CPU).
#
# Fix 1: Skip DRM cards with missing sysfs vendor files (continue instead of
#         abort). This lets gpu_devices stay empty so the PCI fallback runs.
# Fix 2: After PCI fallback also finds nothing (GB10 has no PCI bus entry),
#         synthesize an NVIDIA GPU device when /dev/nvidia0 exists.
#         vendor_id=0x10de + type=GPU is sufficient for CUDA EP to register.
RUN python3 - << 'PYEOF'
import sys

path = "onnxruntime/core/platform/linux/device_discovery.cc"
with open(path) as f:
    src = f.read()

# Fix 1: skip DRM cards that have no sysfs vendor file
old1 = (
    "  for (const auto& gpu_sysfs_path_info : gpu_sysfs_path_infos) {\n"
    "    OrtHardwareDevice gpu_device{};\n"
    "    ORT_RETURN_IF_ERROR(GetGpuDeviceFromSysfs(gpu_sysfs_path_info, gpu_device));\n"
    "    gpu_devices.emplace_back(std::move(gpu_device));\n"
    "  }"
)
new1 = (
    "  for (const auto& gpu_sysfs_path_info : gpu_sysfs_path_infos) {\n"
    "    OrtHardwareDevice gpu_device{};\n"
    "    auto sysfs_status = GetGpuDeviceFromSysfs(gpu_sysfs_path_info, gpu_device);\n"
    "    if (!sysfs_status.IsOK()) {\n"
    "      LOGS_DEFAULT(WARNING) << \"Skipping DRM card (no sysfs vendor info): \"\n"
    "                            << sysfs_status.ErrorMessage();\n"
    "      continue;\n"
    "    }\n"
    "    gpu_devices.emplace_back(std::move(gpu_device));\n"
    "  }"
)
if old1 not in src:
    print("ERROR: Fix-1 pattern not found — check ORT version", file=sys.stderr)
    sys.exit(1)
src = src.replace(old1, new1, 1)
print("Fix 1 applied: DRM skip-on-error")

# Fix 2: /dev/nvidia0 fallback after PCI scan also returns empty
nvidia_fallback = (
    "\n"
    "  // Fallback for SoC/platform GPUs (e.g. DGX Spark GB10) that have neither\n"
    "  // DRM sysfs vendor entries nor a PCI bus representation.\n"
    "  // If /dev/nvidia0 exists the NVIDIA kernel driver is present; synthesize a\n"
    "  // minimal OrtHardwareDevice so the CUDA EP can register itself.\n"
    "  if (gpu_devices.empty()) {\n"
    "    std::error_code _ec{};\n"
    "    if (fs::exists(\"/dev/nvidia0\", _ec)) {\n"
    "      LOGS_DEFAULT(WARNING) << \"/dev/nvidia0 found but no GPU in sysfs/PCI — \"\n"
    "                            << \"adding synthetic NVIDIA GPU device for SoC GPU.\";\n"
    "      OrtHardwareDevice nvidia_gpu{};\n"
    "      nvidia_gpu.vendor_id = 0x10de;\n"
    "      nvidia_gpu.type = OrtHardwareDeviceType_GPU;\n"
    "      gpu_devices.emplace_back(std::move(nvidia_gpu));\n"
    "    }\n"
    "  }\n"
)
anchor = "  gpu_devices_out = std::move(gpu_devices);\n  return Status::OK();\n}"
if anchor not in src:
    print("ERROR: Fix-2 anchor not found — check ORT version", file=sys.stderr)
    sys.exit(1)
src = src.replace(anchor, nvidia_fallback + anchor, 1)
print("Fix 2 applied: /dev/nvidia0 synthetic device fallback")

with open(path, "w") as f:
    f.write(src)
print("device_discovery.cc patched successfully")
PYEOF

RUN ./build.sh \
    --config Release \
    --build_wheel \
    --allow_running_as_root \
    --use_cuda \
    --cuda_home /usr/local/cuda \
    --cudnn_home /usr \
    --cuda_version 13.0 \
    --parallel \
    --cmake_extra_defines \
      CMAKE_CUDA_ARCHITECTURES=121 \
      onnxruntime_USE_FLASH_ATTENTION=OFF \
    --skip_tests

RUN mkdir /ort-wheel && cp build/Linux/Release/dist/onnxruntime_gpu-*.whl /ort-wheel/

# ---------------------------------------------------------------------------
# Stage 2: Install Immich ML Python deps + custom ORT wheel
# ---------------------------------------------------------------------------
FROM python:3.11-bookworm AS builder

ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    VIRTUAL_ENV=/opt/venv

RUN apt-get update && apt-get install -y --no-install-recommends g++ && \
    rm -rf /var/lib/apt/lists/*

COPY --from=ghcr.io/astral-sh/uv:0.8.15@sha256:a5727064a0de127bdb7c9d3c1383f3a9ac307d9f2d8a391edc7896c54289ced0 /uv /uvx /bin/

RUN --mount=type=cache,target=/root/.cache/uv \
    --mount=type=bind,source=uv.lock,target=uv.lock \
    --mount=type=bind,source=pyproject.toml,target=pyproject.toml \
    uv sync --frozen --extra cpu --no-dev --no-editable --no-install-project --compile-bytecode --no-progress --active --link-mode copy

COPY --from=builder-ort /ort-wheel /ort-wheel
# Uninstall the CPU onnxruntime installed by uv sync before putting in our
# custom GPU wheel. onnxruntime and onnxruntime-gpu are different package
# names — --reinstall alone leaves the CPU .so files in place, and Python
# loads the old unpatched cpython extension instead of the GPU one.
RUN uv pip uninstall onnxruntime onnxruntime-gpu 2>/dev/null || true && \
    uv pip install --no-deps /ort-wheel/onnxruntime_gpu-*.whl && \
    rm -rf /ort-wheel

# ---------------------------------------------------------------------------
# Stage 3: Minimal production image
# ---------------------------------------------------------------------------
FROM nvidia/cuda:13.0.2-cudnn-runtime-ubuntu24.04 AS prod

COPY --from=builder /usr/local/bin/python3 /usr/local/bin/python3
COPY --from=builder /usr/local/bin/python3.11 /usr/local/bin/python3.11
COPY --from=builder /usr/local/lib/python3.11 /usr/local/lib/python3.11
COPY --from=builder /usr/local/lib/libpython3.11.so* /usr/local/lib/
RUN ldconfig

ENV LD_PRELOAD=/usr/lib/libmimalloc.so.2 \
    MACHINE_LEARNING_MODEL_ARENA=false

RUN apt-get update && \
    apt-get install -y --no-install-recommends tini ccache libgl1 libglib2.0-0 libgomp1 libmimalloc2.0 && \
    apt-get autoremove -yqq && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

RUN ln -s "/usr/lib/$(arch)-linux-gnu/libmimalloc.so.2" /usr/lib/libmimalloc.so.2

WORKDIR /usr/src
ENV TRANSFORMERS_CACHE=/cache \
    PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PATH="/opt/venv/bin:$PATH" \
    PYTHONPATH=/usr/src \
    DEVICE=cuda \
    VIRTUAL_ENV=/opt/venv \
    MACHINE_LEARNING_CACHE_FOLDER=/cache

RUN echo "hard core 0" >> /etc/security/limits.conf && \
    echo "fs.suid_dumpable 0" >> /etc/sysctl.conf && \
    echo 'ulimit -S -c 0 > /dev/null 2>&1' >> /etc/profile

COPY --from=builder /opt/venv /opt/venv
COPY scripts/healthcheck.py .
COPY immich_ml immich_ml

ENTRYPOINT ["tini", "--"]
CMD ["python", "-m", "immich_ml"]

HEALTHCHECK CMD python3 healthcheck.py

docker-compose.yml (ML service on the DGX Spark)

# Immich machine-learning — DGX Spark (ARM64, GB10, CUDA 13.0)
#
# Prerequisites:
#   1. Build the image (run from immich/machine-learning/):
#        docker build -f Dockerfile.dgx-spark -t immich-ml-dgx-spark:latest .
#      First build ~60-90 min (compiles ORT from source).
#      Subsequent rebuilds use cached Stage 1 and take ~2 min.
#
#   2. NVIDIA Container Toolkit installed and configured as default runtime:
#        sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
#        sudo systemctl restart docker
#
#   3. On the immich-server, set:
#        MACHINE_LEARNING_URL: http://<dgx-spark-ip>:2284

services:
  immich-machine-learning:
    image: immich-ml-dgx-spark:latest
    runtime: nvidia
    container_name: immich-machine-learning
    hostname: immich-machine-learning
    ports:
      - "2284:3003"
    security_opt:
      - no-new-privileges:true
    volumes:
      - immich-ml-cache:/cache
      - immich-ml-config:/.config
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      NVIDIA_DRIVER_CAPABILITIES: compute,utility
      MPLCONFIGDIR: /tmp/matplotlib
      DEVICE: cuda
      MACHINE_LEARNING_WORKERS: 1
      MACHINE_LEARNING_WORKER_THREADS: 4
      MACHINE_LEARNING_LOG_LEVEL: info
      MACHINE_LEARNING_DEVICE_IDS: "0"
    restart: unless-stopped

volumes:
  immich-ml-cache:
  immich-ml-config:

Upstream Fix

The two device_discovery.cc patches apply cleanly to ORT v1.24.4 — the pattern strings
are exact matches. They would need verification against other ORT versions.

The underlying ORT issue (no fallback for SoC/platform GPUs that lack DRM/PCI vendor entries)
should ideally be fixed upstream in
onnxruntime/core/platform/linux/device_discovery.cc.
The fix is straightforward: skip DRM cards that have no vendor file instead of aborting, and
add a CUDA runtime fallback for platform GPUs.


Tested on: NVIDIA DGX Spark GB10, Ubuntu 24.04 ARM64, CUDA 13.0, ORT v1.24.4, Immich v2.7.5

2 Likes