FSDP Training on DGX Spark GB10 UMA — from_pretrained loads full model on all ranks despite device_map=“meta”

Hardware

  • 3x NVIDIA DGX Spark GB10 (Grace Blackwell, 128.5GB LPDDR5x UMA per node)
  • ConnectX-7 200Gb/s RoCE, MikroTik CRS812 switch
  • CUDA 13.0, PyTorch 2.10.0+cu130, Accelerate 1.12.0, Transformers 5.2.0

Model

  • Qwen3.5-35B-A3B (34.66B params, 256-expert MoE)

Problem

Training via FSDP FULL_SHARD across 3 nodes. Expert-level wrapping confirmed working — sum(p.numel()) shows 11.55B params per rank (correct 3-way shard). However, actual memory usage is 75GB per node (should be ~23-33GB for a 3-way shard of 69GB model).

The issue: AutoModelForCausalLM.from_pretrained() loads the full 69GB model on EVERY rank before FSDP can shard. On UMA, this means 69GB × 3 = 207GB consumed before sharding even begins.

What we’ve tried

1. device_map="meta" in from_pretrained

model = AutoModelForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.bfloat16, trust_remote_code=True,
    low_cpu_mem_usage=True, device_map="meta")

Result: TypeError: Parameter.__new__() got an unexpected keyword argument '_is_hf_initialized' — version mismatch between accelerate 1.12.0 and transformers 5.2.0.

2. FSDP_CPU_RAM_EFFICIENT_LOADING=true env var

Set in both the YAML config and as an environment variable. Expected Accelerate to intercept from_pretrained and use meta device on non-rank-0. Result: All ranks still show “Loading weights: 693 tensors” — full materialization on every rank.

3. Manual init_empty_weights() + from_config()

with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)

Result: Weight keys don’t match checkpoint (model uses model.language_model.layers.* prefix from VLM wrapper, from_config produces model.layers.*).

4. Asymmetric loading (rank 0 real, others meta)

if accelerator.is_main_process:
    model = AutoModelForCausalLM.from_pretrained(...)
else:
    with init_empty_weights():
        model = AutoModelForCausalLM.from_config(full_config, ...)

Result: FSDP sync_module_states hangs — module structure mismatch between ranks causes deadlock during broadcast.

What works

  • Expert-level FSDP wrapping: Adding Qwen3_5MoeExperts and Qwen3_5MoeSparseMoeBlock to the wrap policy correctly shards the MoE experts. Without this, FSDP wraps only at Qwen3_5MoeDecoderLayer and experts stay replicated.
  • Network: MTU 9000 end-to-end, IPv6 disabled, single CX7 half at 11.73 GB/s all_reduce.
  • FSDP prepare completes and reports correct param counts.
  • Forward pass works (15.6GB for batch=1 seq=512).

The specific question

On GB10 UMA, how do you make from_pretrained NOT load the full model on every rank? The standard mechanism (fsdp_cpu_ram_efficient_loading) doesn’t seem to work with Accelerate 1.12.0 + Transformers 5.2.0 + this model (Qwen3.5-35B-A3B with trust_remote_code=True).

Is there a recommended pattern for FSDP on DGX Spark GB10 that handles the shared CPU/GPU memory constraint?

Memory trace data

Pre-training baseline: 75.75GB used per rank (11.55B sharded params = should be ~23GB)
Forward pass: +15.6GB (52.7→37.1GB free)
Backward pass: +52GB spike → OOM at 128.5GB limit

Versions

torch: 2.10.0+cu130
accelerate: 1.12.0
transformers: 5.2.0
NCCL: 2.28.9+cuda13.0
CUDA: 13.0
Driver: 580.95.05

UPDATE: Expert wrapping confirmed working, loading asymmetry remains

What we solved

  • Expert-level FSDP wrapping confirmed: adding Qwen3_5MoeExperts and Qwen3_5MoeSparseMoeBlock to the wrap policy gives correct 11.55B params per rank (3-way shard of 34.66B).
  • Without this, FSDP wraps only at Qwen3_5MoeDecoderLayer and the 256 experts stay replicated (76GB per node instead of 23GB).
  • Forward pass with sharding: 15.6GB consumed — correct for batch=1 seq=512.

What remains broken

Even with correct sharding, the pre-training baseline is still 75GB per node. The issue is the model loading phase: from_pretrained loads ~6GB per rank (memory-mapped safetensors), but after accelerator.prepare() (FSDP wrap + sync), each node shows 75GB used.

We believe FSDP’s sync_module_states broadcast is not properly freeing the pre-shard copy of the model. The sharded FlatParameters (11.55B params = ~23GB) are created, but the original tensors from the pre-wrap model remain in memory.

Monkey-patch for _is_hf_initialized

Transformers 5.x injects _is_hf_initialized as a kwarg into nn.Parameter.__new__ during init_empty_weights. Custom VLM models (trust_remote_code) reject this unexpected kwarg, causing Accelerate’s efficient loading to silently fall back to full loading on all ranks.

Monkey-patch applied:

_orig_param_new = torch.nn.Parameter.__new__
def _patched_param_new(cls, *args, **kwargs):
    kwargs.pop('_is_hf_initialized', None)
    return _orig_param_new(cls, *args, **kwargs)
torch.nn.Parameter.__new__ = staticmethod(_patched_param_new)

We also tried device_map="meta" on non-rank-0 with the patch, but got RDMA errors because meta tensors can’t receive NCCL broadcasts.

Question for NVIDIA/community

On GB10 UMA with a custom VLM model (trust_remote_code=True, Qwen3.5-35B-A3B):

  1. Is there a recommended way to ensure only rank 0 loads weights and others get meta-initialized tensors for sync_module_states to broadcast into?
  2. Should we use torch.device("cpu") + from_config + p.data.zero_() on non-rank-0 (as suggested by some), or is there a cleaner path?
  3. After FSDP wrapping, what mechanism frees the pre-shard model tensors? On UMA, they stay in the shared pool and aren’t reclaimed.

Updated versions

  • accelerate: 1.13.0 (upgraded from 1.12.0)
  • transformers: 5.3.0 (upgraded from 5.2.0)
  • torch: 2.10.0+cu130
  • NCCL: 2.28.9+cuda13.0