Many script in parallels in spark?

deeduckme · January 10, 2026, 10:55am

Hi everyone,

I’m currently using an NVIDIA DGX Spark to run multiple workloads in parallel, but in practice everything behaves almost sequentially, and I’m trying to understand how to improve real parallelism and overall throughput.

My current setup

At the same time, I usually run:

2–3 Python scripts using Ollama with text models
2 Python scripts generating images
2 Python scripts generating videos

For video generation, I use Flux2 + ComfyUI, launched in low VRAM mode:

python main.py --listen 0.0.0.0 \
  --reserve-vram 4.0 \
  --disable-cuda-malloc \
  --lowvram \
  --use-pytorch-cross-attention

For image generation, I explicitly serialize GPU access using an inter-process GPU lock to avoid OOMs:

# =========================================================
# GPU LOCK (inter-process, shared across scripts)
# =========================================================

GPU_LOCK_FILE = "/tmp/comfyui_gpu.lock"

@contextmanager
def gpu_lock(tag: str):
    with open(GPU_LOCK_FILE, "w") as f:
        print(f"[GPU LOCK] waiting ({tag}) …")
        fcntl.flock(f, fcntl.LOCK_EX)
        try:
            print(f"[GPU LOCK] acquired ({tag})")
            yield
        finally:
            fcntl.flock(f, fcntl.LOCK_UN)
            print(f"[GPU LOCK] released ({tag})")

The problem

Even though I launch multiple scripts in parallel, GPU workloads appear to execute mostly sequentially:

GPU utilization oscillates instead of staying high
Video and image jobs tend to block each other
Text inference (Ollama) also seems impacted when vision workloads are running

I understand why this happens (single GPU, VRAM pressure, CUDA context contention), but I’m looking for practical ways to improve concurrency, not just theory.

What I’m looking for

I’d appreciate feedback or real-world experience on:

Better GPU scheduling strategies for mixed text / image / video workloads
Whether CUDA MPS, multiple CUDA streams, or process-level isolation actually help in this kind of setup
ComfyUI / Flux2-specific optimizations for concurrent runs
Smarter alternatives to coarse GPU locks (priority queues, job batching, async pipelines, etc.)
Any Spark-specific tuning that helps with parallel inference

My goal is simple: accelerate content generation throughput, even if each individual job becomes slightly slower.

Thanks in advance for any insight or recommendations.

Topic		Replies	Views
Making Apache Spark More Concurrent Technical Blog	1	418	December 19, 2020
Scaling Diffusion Workloads with Multiple DGX Spark Systems DGX Spark / GB10	2	189	January 3, 2026
Unlocking the Power of the Spark In ComfyUI - No Crashes! DGX Spark / GB10 Projects	7	935	March 15, 2026
Models not using Spark GPU? DGX Spark / GB10 containers	10	506	December 15, 2025
Multi-GPU contention inside CUDA CUDA Programming and Performance cuda , nsight , performance	2	1410	January 11, 2023
DGX Spark Image and Video Generation performance? DGX Spark / GB10 performance , generative_ai	2	960	December 15, 2025
Anyone with 2 or more Sparks have a chance to try out new graph split mode in ik_llama? DGX Spark / GB10 llama	3	201	January 5, 2026
Python and multi stream question Jetson AGX Xavier cuda , python	4	851	February 9, 2022
Best practices for running llvm bench DGX Spark / GB10	2	129	December 21, 2025
Unable to Run Parallel Inference on Two GPUs Using Python (Multi-Model, Multi-Queue Setup) CUDA Programming and Performance	4	131	December 29, 2025

Many script in parallels in spark?

My current setup

The problem

What I’m looking for

Related topics