Unsloth 2048 RL training - very slow

Hello everyone,

Has anyone tried the tutorial from Unsloth https://unsloth.ai/docs/basics/fine-tuning-llms-with-nvidia-dgx-spark-and-unsloth ?

I’m facing some issues with building the Dockerfile on my machine (building xformers from source fails). However, I managed to run it using this setup script:

uv venv
source .venv/bin/activate
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
uv pip install transformers peft datasets
uv pip install --no-deps unsloth unsloth-zoo
uv pip install --no-deps bitsandbytes
uv pip install --upgrade torchao
uv pip install --upgrade unsloth unsloth-zoo transformers
uv pip install xformers

The article mentions a training time of around 4 hours, but on my machine, I was only able to reduce it from 72 hours to approximately 24 hours by tweaking generation_length and batch size. Still, it’s quite far from the 4-hour target.

Has anyone else tried this tutorial? What kind of performance or training times are you getting?

Thanks in advance!

Could you try Unsloth Finetuning Playbook - Fine-tuning GPT-OSS-20B with GB10 Forum Data - #4

Thanks for your answer.

I downloaded the dockerfile from your repo Dockerfile.dgx_spark.

Then i launched docker build but i’ve got the same error.

ERROR: failed to build: failed to solve: process “/bin/sh -c git clone --depth=1 https://github.com/facebookresearch/xformers` --recursive && cd xformers && export TORCH_CUDA_ARCH_LIST=“12.1” && python setup.py install && cd ..” did not complete successfully: exit code: 1`

Could you try rebuilding your docker image on your machine. I suspect that the problems comes from the xformers library that is no longer compatible.

Ok, these were the versions last time I ran:
dependencies.txt (6.0 KB)