Has anyone been able to get Ostris' AI Toolkit running on DGX Spark?

As per the title, has anyone managed to get ai-toolkit running on DGX spark?

I had to compile and install onnxruntime myself, and since the version that ai-toolkit normally uses doesn’t support Cuda 13, I’ve compiled the latest and installed that from source.

I have got the ai-toolkit UI running, but when I try and start a training job, nothing happens.

I’m pretty sure I saw someone commented that they’ve been using ai toolkit since they got the spark, but how they actually got it running is a mystery to me as I’ve been at it all day with no luck. It just says in the console that it’s started the job with this id (followed by a UUID), but in the UI, I don’t get any output, nor do I see any GPU activity, it’s supposed to say something like running 1 job, and then kick it off, but for me it just sits there doing nothing.

I’ll be watching this thread as I’d like to run AI Toolkit on my DGX Spark, too.

If it helps anyone, here are the instructions I wrote down around how to compile and install onnxruntime as they don’t maintain a version for ARM64, I was thinking about posting a full tutorial if I got ai-toolkit running, but like I said, no luck, and I’ve been trying to get it working for a while now.

$ git clone --recursive https://github.com/Microsoft/onnxruntime.git

$ pip install cmake ninja packaging numpy

$ sudo apt-get -y install cudnn9-cuda-13

$ sh build.sh --config Release --build_dir build/cuda13 --parallel 20 --nvcc_threads 20 \
            --use_cuda --cuda_version 13.0 --cuda_home /usr/local/cuda-13.0/ \
            --cudnn_home /usr/local/cuda-13.0/ \
            --build_wheel --skip_tests \
            --cmake_generator Ninja \
            --use_binskim_compliant_compile_flags \
            --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=121 onnxruntime_BUILD_UNIT_TESTS=OFF

$ pip install build/cuda13/Release/dist/*.whl

This will get onnxruntime built for DGX Spark and get the resulting wheel installed by pip. This is the dependency that fails to install when you try and install the requirements for ai-toolkit. There is another flag that I haven’t been able to pass successfully to the compile command, which is --enable_training, but as soon as I include that, the compile fails, so that might be related to it not working. The problem is, there’s no output I can seem to reference, so it’s hard to tell where to go from here, it just doesn’t kick off the training job, I don’t see any errors or anything else.

I also messed around with several different versions of PyTorch to see if that could explain the issues I’m seeing, but again, couldn’t find one that works. If you try and use the exact versions mentioned in the instructions, that tops out at Cuda 12.8, but I think Cuda is somewhat backwards compatible, so it should still work, though correct me if I’m wrong there. Also tried installing the latest version of PyTorch for Cuda 13, and again, no luck.

2 Likes

I haven’t spent too much additional time trying to get AI Toolkit running, I’m not getting any output, so it’s hard to tell what’s going wrong from here. I did message Ostris, but didn’t get a response around how to debug what’s happening. So I’ve been trying some alternatives.

Both diffusion-pipe and musubi-tuner work well on the DGX Spark, though not having a GUI, they are harder to set up, but do work once you get it running. If you’re wondering which is easier, it’s hard to say, I think in terms of pure setup, I’d probably lean towards diffusion-pipe, though musubi-tuner has better documentation, and some benefits like being able to train FramePack LoRAs (though I haven’t personally tried this type of fine-tune myself, also worth noting that FramePack Studio, which is probably the best version of FramePack doesn’t work on the DGX Spark, though I think it may be related to some customisations that were made, so it might be easy to make a version that does work).

I’ll let people know if there are any further developments, but for now, I’d suggest using diffusion-pipe or musubi-tuner on the Spark.

I’ve got it running, as in I can browse to the web interface. However the job I am trying to run is stuck on “Starting Job…”, “running“. Not sure if this is further than you’ve got as it’s my first time using AI toolkit

No, that’s where I’m at as well…. the problem is, it leaves no logs when you try to start a job, at least none that I could find, and not being familiar with the codebase, it’s hard to debug further without diving into the weeds, which I don’t have time for at the moment.

EDIT: I got a response from Ostris, who suggested that we try running the training directly via the cli, he says if it crashes straight away the UI probably won’t log anything. So for now I’m going to look into that.

I attempted to run via the cli (running “python run.py config/whatever_you_want.yml”) however I believe I am having issues with a dependent package “onnxruntime-gpu“ which I think is a dependency of one of the packages in the aitoolkit requirements.txt - onnxruntime-gpu only support x64, therefore I’m not even sure if the cli is possible with ai toolkit on x86 systems

Therefore I am interested in the two other options you said will work on the spark; do you have any resources you’d recommend to help get these up and running?

Ostris has been really helpful in answering my questions yesterday, and I actually have it running now, but it’s a bit messy. I’ll try and document the steps later, and will probably need some help determining what dependencies need to be installed or not as I had to apt install a few things and not all of it might be needed.

There are some quirks and things I’m not crazy about though, I tested it with Qwen-Image-Edit-2509, and it takes around 35 minutes to generate one sample image, I have no idea why, but that is an unusually long time. The other thing I’m not a fan of is the general amount of system ram it uses while training, on a normal box with separate RAM and VRAM, this is probably fine, but for me it’s using around 40GB of system ram, which means you’ve basically lost a third of the VRAM on your Spark, which is not something I’ve observed on diffusion-pipe or musubi-tuner which uses around 2.5GB from memory, so I hope those can be fixed at some point.

There’s also something to be said for the hoops you have to jump through to get ai-toolkit running on the Spark (it has a ton of dependencies which doesn’t help), whereas I don’t recall having to do anything special for diffusion-pipe or musubi-tuner, they just worked.

Basic tutorial to get Ostris’ AI Toolkit running on DGX Spark:

Here are some extra dependencies you MIGHT need. I’m unsure which of these are actually required as I was experimenting quite a lot, I suspect none of these are needed, but would appreciate if someone can confirm and let us know which, if any, are needed, then I’ll update these instructions:

gfortran:
sudo apt install gfortran

OpenBLAS:
$ git clone https://github.com/OpenMathLib/OpenBLAS.git
$ make
$ sudo make install

liblapack and libblas:
$ sudo apt install liblapack-dev libblas-dev

Rust + Cargo:
$ sudo apt install cargo

Here are the steps to install AI Toolkit:

1) Install node
I’m not going to go into a huge amount of detail with this, you basically want to get the ARM64 version for Linux:

https://nodejs.org/dist/v24.11.1/node-v24.11.1-linux-arm64.tar.xz

Extract that somewhere and add it into your path, I simply added the following to my ~/.bashrc file:
export PATH=“/opt/node-v24.11.1-linux-arm64/bin:$PATH”

2) Get Python 3.11 (miniconda recommended for this)
There are packages that require at least 3.10, and other packages that require a version lower than 3.12, so through experimentation I’ve concluded that it really only works with Python 3.11.

The easiest way to do this is to install miniconda:

$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh
$ chmod u+x Miniconda3-latest-Linux-aarch64.sh
$ ./Miniconda3-latest-Linux-aarch64.sh

If you want to disable it loading the base environment by default (which I recommend), run:

$ conda config --set auto_activate_base false

Now you can create a Python 3.11 environment for ai-toolkit:

$ conda create --name ai-toolkit python=3.11

And then activate the environment:

$ conda activate ai-toolkit

3) Install PyTorch (make sure your conda environment from step 2 is activated)
To be honest, I’m not sure if you need this exact version of PyTorch, but that’s what’s mentioned on the official AI Toolkit page, that version does not have a CUDA 13.0 version available, just a CUDA 12.8 version, so that’s what we’re going to install here, I suspect the latest CUDA 13 version will likely work:

$ pip3 install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128

4) Tweak the requirements.txt file
For some reason the DGX Spark seems to have trouble figuring out which versions of some of the dependencies it needs to install, as a result, we need to lock in some of the versions, to do this, add these entries to the requirements.txt file:

scipy==1.16.0
tifffile==2025.6.11
imageio==2.37.0
scikit_image==0.25.2
clean_fid==0.1.35
pywavelets==1.9.0
contourpy==1.3.3
opencv_python_headless==4.11.0.86

Now remove:

git+https://github.com/jaretburkett/easy_dwpose.git

Based on my discussions with Ostris, easy_dwpose was used for auto generating pose estimations for flex.2 but he says he can make it optional if import fails. I’ve only tested training Qwen-Image-Edit-2509 and that worked fine without it. This dependency pulls in a bunch of troublesome libraries like onnxruntime, so my recommendation is to skip it.

5) Install the requirements.txt

$ pip3 install -r requirements.txt

6) Compile and run the node UI

Go to your ai-toolkit/ui folder and run:

$ npm run build_and_start

If all went well, you’ll be able to access the UI and kick off training jobs. If you’re not getting output when you’re starting a job, most likely it’s crashing before the process has started, the best way to debug these issues is using the CLI: The CLI is called from the UI, it just does it in the background. Just set up something in the UI, go to the advanced config screen and copy and paste the config into a file like train.yaml then with the virtual environment active, just run:

$ python run.py path/to/train.yaml
2 Likes

Can someone please try these instructions and advise if you needed any of the extra dependencies??? I’d like to do a PR for the official GitHub repo to support the Spark and I can’t do anything until I know which dependencies are needed. Again, I don’t think any of them are needed, but I need to know this for certain.

Thank you for your detailed instructions. I have followed them (without these that were on top - gfortran, etc. - as you weren’t sure if these steps are necessary) and AI Toolkit works fine, although training WAN 2.2 5B TI2V is very, very slow.

The most important thing is that it works. I hope that there is some way to improve it’s performance on DGX Spark.

Thanks for letting me know. I wanted to edit my instructions to remove those extra dependencies, but looks like it won’t let me edit my post anymore.

Around it being slow, sample generation is quite slow, I have no idea why, but it can be disabled. The training itself seems fine speed wise for Qwen-Image-Edit, with that said, it hogs a ton of system RAM. Training Qwen-Image-Edit, I found it left me with about 10GB of free ram if I don’t use quantisation. With diffusion pipe and musuai tuner, I found I was able to comfortably train and test my LoRAs at the same time with the FP8 models in ComfyUI, something that can’t be done when you just have 10GB of VRAM left.

I think it’s great to have AI toolkit running, but right now it’s hard to recommend, at least for Qwen-Image-Edit, as we’re losing almost 40GB of shared RAM which doesn’t seem to be an issue with the other trainers.

Awesome to see this. Here’s hoping Ostris can work some magic to make things faster. I do have to wonder if we just don’t notice the system ram allocation on a normal PC without shared memory.. assuming lots of ppl have 64gb+ ram and then 24gb+ vram. I have certainly seen AI toolkit offload 60 gb+ into system ram.

There’s nothing special about it running on the Spark vs it running anywhere else, so yes, I would expect that it would allocate the same amount of system ram, unfortunately, on the Spark it is equivalent to reducing your VRAM by that amount. I have mentioned this to Ostris, but he hasn’t really commented on the ram usage. As for the speed, it’s hard to comment, but it seemed okay to me on the surface, aside from sample generation.

I prefered to install the IT-Toolkit in a Docker environment to preserve an successfull installation.
Thanks a lot for this description in this forum. With this I was able to create it.

I was not succesful with nvcr.io/nvidia/pytorch because using an an older versions of pytorch leads to a missing support of this GPU. Therefore I started with pure ubuntu:22.04.

# Use Ubuntu as base image 
FROM ubuntu:22.04

# Avoid prompts during package installation
ENV DEBIAN_FRONTEND=noninteractive

# ---------------------------------------------------------------------
#                       Setup Python 3.11
# ---------------------------------------------------------------------

# Install system dependencies including Python 3.11
RUN apt-get update && \
    apt-get install -y wget xz-utils git build-essential python3.11 python3.11-venv python3.11-dev python3-pip && \
    rm -rf /var/lib/apt/lists/*

# Make python3.11 default python
RUN ln -sf /usr/bin/python3.11 /usr/bin/python3 && \
    ln -sf /usr/bin/python3.11 /usr/bin/python
    
# Install Node.js ARM64 binary

WORKDIR /opt

RUN wget https://nodejs.org/dist/v24.11.1/node-v24.11.1-linux-arm64.tar.xz && \
    tar -xf node-v24.11.1-linux-arm64.tar.xz && \
    rm node-v24.11.1-linux-arm64.tar.xz

# Add Node.js to PATH
ENV PATH="/opt/node-v24.11.1-linux-arm64/bin:${PATH}"

# ---------------------------------------------------------------------
#                       Build AI-toolkit
# ---------------------------------------------------------------------

WORKDIR /app

RUN git clone https://github.com/ostris/ai-toolkit.git 

WORKDIR /app/ai-toolkit

RUN pip install --no-cache-dir torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128

RUN apt update && apt install -y libgl1-mesa-glx libglib2.0-0

# Create a requirements.txt file
RUN sed -i -e '$a\' requirements.txt && \
    sed -i '/git+https:\/\/github.com\/jaretburkett\/easy_dwpose.git/d' requirements.txt && \
    echo 'numpy<2.0' >> requirements.txt && \
    echo 'pandas>=2.2.0' >> requirements.txt && \
    echo 'scipy==1.16.0' >> requirements.txt && \
    echo 'tifffile==2025.6.11' >> requirements.txt && \
    echo 'imageio==2.37.0' >> requirements.txt && \
    echo 'scikit_image==0.25.2' >> requirements.txt && \
    echo 'clean_fid==0.1.35' >> requirements.txt && \
    echo 'pywavelets==1.9.0' >> requirements.txt && \
    echo 'contourpy==1.3.3' >> requirements.txt && \
    echo 'opencv_python_headless==4.11.0.86' >> requirements.txt

# Install the requirements
RUN pip install -r requirements.txt

# ---------------------------------------------------------------------
#                       Build UI
# ---------------------------------------------------------------------

WORKDIR /app/ai-toolkit/ui

# Install NPM deps
RUN npm install

# Apply Prisma migrations
RUN npx prisma migrate deploy
RUN npx prisma migrate dev --name add_queue_table

# Build frontend assets
RUN npm run build

# ---------------------------------------------------------------------

# System update und curl installation for healthcheck
RUN apt-get update && \
    apt-get install -y curl && \
    rm -rf /var/lib/apt/lists/*
    
EXPOSE 8675

LABEL maintainer="ps" \
      description="AI Toolkit"

HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD curl -f http://localhost:8675/ || exit 1

CMD ["npm", "start" ]

# ---------------------------------------------------------------------
# Command to create the Docker Image
# docker build -f Dockerfile.inference -t ai_toolkit_cu128:arm64 .
# ---------------------------------------------------------------------


With this I was able to create a Lora for FluxDev, Z-image and Qwen.
Lora training for Flux2 fails, because it is too big with the default settings and then the DGX GUI was frozen.
However, Flux2 with ComfyUI for inference on DGX Spark is pushing the limits but working fine.

root@psdgx:/app/ai-toolkit/ui# nvidia-smi
Wed Dec  3 05:55:04 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0  On |                  N/A |
| N/A   72C    P0             70W /  N/A  | Not Supported          |     96%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A          529168      C   python                                49841MiB |
+-----------------------------------------------------------------------------------------+

I run now a dataset with 17 images (1280x1280) and 3000 steps for z-Image and every 250 steps 2 check images.
It needs around 34 sec/inter. So it will take around 28h in total.

Regards,
Peter

1 Like

I followed the instructions and managed to get the UI working (haven’t had a chance to try it yet). But it’s worth noting that you do need to follow the start of the instructions from the AI toolkit too - I was blindly following them without installing the AI-toolkit directory and installing stuff from totally the wrong requirements.txt. Dumb certainly, but sure other people might make the same mistake (not to mention AI parsing the page). Might want to create a definitive set of instructions.

The instructions I posted are definitive.

The instructions cover installing node, getting a conda environment up with Python 3.11, installing PyTorch, updating requirements.txt, installing requirements.txt, and compiling and starting the node-based UI. That’s everything needed to get ai-toolkit running.

I don’t know what start instructions or wrong requirements.txt you are referring to, but it sounds like you’re doing something wrong. There is only one requirements.txt in the ai-toolkit repository and that’s the only one you need to edit and install. Just add the dependencies I listed to it, and remove dw_pose from it, and then pip install it. There is only one requirements.txt, you’re just customising it before installing it, since DGX OS seems to have a problem identifying some of the dependency versions it needs to install.

I’ve been experimenting a bit more, and I think I may have gotten a breakthrough, but this definitely needs more testing.

I switched from the recommended PyTorch version to the latest CUDA13 version, so instead of the command used in the tutorial above, I did this instead:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

With this, it’s significantly faster, and seems to fix the problem with long sample generation times (mine dropped from 30-35 minutes to around 5 minutes, which is still long, but manageable). My training times dropped from 30+ seconds per iteration to around 10-12 seconds per iteration.

I’ve only done a single test, so maybe I’m completely wrong, but would appreciate if others can say what they found.

One thing that is a bit worrying is that I’ve been getting an error when attempting to train on the older version of PyTorch. I’m not sure if NVIDIA’s rolled out an update that has broken it or something, but I’m struggling to get it working based on the older version now, so if you do have a working conda environment, I would suggest just creating a new environment, rather than messing with the working one.

2 Likes

WOW! @RazielAU - that’s amazing! I don’t have exact numbers, but after installing fresh environment with cu130 this time I also see the improvements (I started training of WAN 2.2 14B T2V Lora and sampling is 2x faster than before, training also seems faster)! Now sampling takes similar times as it does in ComfyUI (and it makes sense - NVidia’s instructions for installing ComfyUI were also using cu130 versions of these libraries).

Thank you so much for your contribution! Now using AI Toolkit seems much more feasible on DGX Spark than before.

1 Like

No - you simply missed out the following:

git clone https://github.com/ostris/ai-toolkit.git
cd ai-toolkit

Nothing exceptionally wrong lol! Just there are some who will come here (especially AI scrappers) that will miss the fact you actually need to get the directory…

It’s awesome what you done btw.