Hardware
- 3x NVIDIA DGX Spark GB10 (Grace Blackwell, 128.5GB LPDDR5x UMA per node)
- ConnectX-7 200Gb/s RoCE, MikroTik CRS812 switch
- CUDA 13.0, PyTorch 2.10.0+cu130, Accelerate 1.12.0, Transformers 5.2.0
Model
- Qwen3.5-35B-A3B (34.66B params, 256-expert MoE)
Problem
Training via FSDP FULL_SHARD across 3 nodes. Expert-level wrapping confirmed working — sum(p.numel()) shows 11.55B params per rank (correct 3-way shard). However, actual memory usage is 75GB per node (should be ~23-33GB for a 3-way shard of 69GB model).
The issue: AutoModelForCausalLM.from_pretrained() loads the full 69GB model on EVERY rank before FSDP can shard. On UMA, this means 69GB × 3 = 207GB consumed before sharding even begins.
What we’ve tried
1. device_map="meta" in from_pretrained
model = AutoModelForCausalLM.from_pretrained(
model_path, torch_dtype=torch.bfloat16, trust_remote_code=True,
low_cpu_mem_usage=True, device_map="meta")
Result: TypeError: Parameter.__new__() got an unexpected keyword argument '_is_hf_initialized' — version mismatch between accelerate 1.12.0 and transformers 5.2.0.
2. FSDP_CPU_RAM_EFFICIENT_LOADING=true env var
Set in both the YAML config and as an environment variable. Expected Accelerate to intercept from_pretrained and use meta device on non-rank-0. Result: All ranks still show “Loading weights: 693 tensors” — full materialization on every rank.
3. Manual init_empty_weights() + from_config()
with init_empty_weights():
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
Result: Weight keys don’t match checkpoint (model uses model.language_model.layers.* prefix from VLM wrapper, from_config produces model.layers.*).
4. Asymmetric loading (rank 0 real, others meta)
if accelerator.is_main_process:
model = AutoModelForCausalLM.from_pretrained(...)
else:
with init_empty_weights():
model = AutoModelForCausalLM.from_config(full_config, ...)
Result: FSDP sync_module_states hangs — module structure mismatch between ranks causes deadlock during broadcast.
What works
- Expert-level FSDP wrapping: Adding
Qwen3_5MoeExpertsandQwen3_5MoeSparseMoeBlockto the wrap policy correctly shards the MoE experts. Without this, FSDP wraps only atQwen3_5MoeDecoderLayerand experts stay replicated. - Network: MTU 9000 end-to-end, IPv6 disabled, single CX7 half at 11.73 GB/s all_reduce.
- FSDP prepare completes and reports correct param counts.
- Forward pass works (15.6GB for batch=1 seq=512).
The specific question
On GB10 UMA, how do you make from_pretrained NOT load the full model on every rank? The standard mechanism (fsdp_cpu_ram_efficient_loading) doesn’t seem to work with Accelerate 1.12.0 + Transformers 5.2.0 + this model (Qwen3.5-35B-A3B with trust_remote_code=True).
Is there a recommended pattern for FSDP on DGX Spark GB10 that handles the shared CPU/GPU memory constraint?
Memory trace data
Pre-training baseline: 75.75GB used per rank (11.55B sharded params = should be ~23GB)
Forward pass: +15.6GB (52.7→37.1GB free)
Backward pass: +52GB spike → OOM at 128.5GB limit
Versions
torch: 2.10.0+cu130
accelerate: 1.12.0
transformers: 5.2.0
NCCL: 2.28.9+cuda13.0
CUDA: 13.0
Driver: 580.95.05