Hey everyone, hitting a wall fine-tuning on a DGX Spark (GB10 / Blackwell sm_121) and desperately looking for a working tech stack.
I’m trying to run a 4-bit QLoRA on Qwen2.5-7B with a large-context dataset (~20k tokens/example).
The Problem: Training is crawling at 0.01 it/s. nvidia-smi shows 95% SM utilization but 0% memory bandwidth and abnormally low power draw (~75W instead of 150W+). It looks like a severe kernel stall on the new sm_121 architecture, causing a massive CPU bottleneck.
What I’ve Tried:
I originally thought this was an Unsloth / Triton compatibility issue. However, I have now tested the official PyTorch container and LLaMA-Factory, and I am getting the exact same 0% memory bandwidth problem across the board.
The Ask:
Has anyone successfully fine-tuned on a GB10?
What base container, PyTorch version, or overall tech stack actually gets memory bandwidth flowing on sm_121 without stalling?
Are there specific unified memory flags I’m missing for this hardware?
Thank you so much! I will check them out right now! And I wanted to ask about the input for the LLM
I am currently using this format
role"system"
content"You are an expert predicting …"
role"user"
content"Your task: Predict ALL… "
role"assistant"
content"{“…]}”
And they are about 18000 tokens, and I have for training 40.000 and for valid 2000
They are too big for fine tune? or can be how the GPU memory is used? because even though I see 95% used, is very slow, for a 270M model (gemma) It takes around 40 hours 3 epochs
I think something is wrong. In one of these tutorials I’ve fine tuned a 20B parameter model with all the messages of this forum at the time. Are you doing QLORA/LORA or Full fine tuning?
And I have tried PyTorch Containers, Unsloth, Llama Factory
And now I was trying NeMo but here I got this error with the official playbook (step 7) - error: torch was declared as an extra build dependency with match-runtime = true, but was not found in the resolution
(Automodel) root@d63c3942d426:/workspace/Automodel# pip install --no-deps -e .
Can you help me debug this problem or to check why is so slow
I was checking also the GPU usage 95%, bandwidth 0% (and with nvidia-smi GPU usage showed around 20GB used) -
And I have tried llama 3.1 8B, Qwen 2.5 8B and 3B and nothing worked, that’s why I thing I do not use the unified memory correctly, I tried all version for 2 weeks now..
I attached also a version to see what I used for one of the tries.
I want a solution to be able to fine tune a model, because I see in many forums that must take way less time than I took me.
Thank you so much for reaching out!
(Attachment llama_3_1_8B_skeleton_support.ipynb is missing)
Please DM me your notebook file(s) or share it here if there’s nothing private, but first execute the steps I asked for debug purposes. You’ve mentioned Qwen2.5-7B, but is attaching llama. I don’t get it.