Hi everyone,
I’m currently using an NVIDIA DGX Spark to run multiple workloads in parallel, but in practice everything behaves almost sequentially, and I’m trying to understand how to improve real parallelism and overall throughput.
My current setup
At the same time, I usually run:
-
2–3 Python scripts using Ollama with text models
-
2 Python scripts generating images
-
2 Python scripts generating videos
For video generation, I use Flux2 + ComfyUI, launched in low VRAM mode:
python main.py --listen 0.0.0.0 \
--reserve-vram 4.0 \
--disable-cuda-malloc \
--lowvram \
--use-pytorch-cross-attention
For image generation, I explicitly serialize GPU access using an inter-process GPU lock to avoid OOMs:
# =========================================================
# GPU LOCK (inter-process, shared across scripts)
# =========================================================
GPU_LOCK_FILE = "/tmp/comfyui_gpu.lock"
@contextmanager
def gpu_lock(tag: str):
with open(GPU_LOCK_FILE, "w") as f:
print(f"[GPU LOCK] waiting ({tag}) …")
fcntl.flock(f, fcntl.LOCK_EX)
try:
print(f"[GPU LOCK] acquired ({tag})")
yield
finally:
fcntl.flock(f, fcntl.LOCK_UN)
print(f"[GPU LOCK] released ({tag})")
The problem
Even though I launch multiple scripts in parallel, GPU workloads appear to execute mostly sequentially:
-
GPU utilization oscillates instead of staying high
-
Video and image jobs tend to block each other
-
Text inference (Ollama) also seems impacted when vision workloads are running
I understand why this happens (single GPU, VRAM pressure, CUDA context contention), but I’m looking for practical ways to improve concurrency, not just theory.
What I’m looking for
I’d appreciate feedback or real-world experience on:
-
Better GPU scheduling strategies for mixed text / image / video workloads
-
Whether CUDA MPS, multiple CUDA streams, or process-level isolation actually help in this kind of setup
-
ComfyUI / Flux2-specific optimizations for concurrent runs
-
Smarter alternatives to coarse GPU locks (priority queues, job batching, async pipelines, etc.)
-
Any Spark-specific tuning that helps with parallel inference
My goal is simple: accelerate content generation throughput, even if each individual job becomes slightly slower.
Thanks in advance for any insight or recommendations.