Basic version info:
python 3.12.0
pytorch 2.5.0
nvidia-modelopt 0.21.0
cuda: 12.6
I’m training flux-dev model of sparsity with accelerate.
This is FSDP config with accelerator.
distributed_type: FSDP
fsdp_config:
fsdp_auto_wrap_policy: SIZE_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_forward_prefetch: true
fsdp_min_num_params: 1000000
fsdp_offload_params: true
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: true
when I do:
flux = mto.restore(flux, sparse_ckpt)
flux = accelerator.prepare_model(flux)
print(flux)
### Errors as follows:
[rank1]: Traceback (most recent call last):
[rank1]: File “/data/train_flux.py”, line 447, in
[rank1]: main()
[rank1]: File “/data/train_flux.py”, line 165, in main
[rank1]: print(dit)
[rank1]: File “/root/miniforge3/envs/py312torch250/lib/python3.12/site-packages/torch/nn/modules/module.py”, line 2943, in repr
[rank1]: mod_str = repr(module)
[rank1]: ^^^^^^^^^^^^
[rank1]: File “/root/miniforge3/envs/py312torch250/lib/python3.12/site-packages/torch/nn/modules/module.py”, line 2943, in repr
[rank1]: mod_str = repr(module)
[rank1]: ^^^^^^^^^^^^
[rank1]: File “/root/miniforge3/envs/py312torch250/lib/python3.12/site-packages/torch/nn/modules/module.py”, line 2937, in repr
[rank1]: extra_repr = self.extra_repr()
[rank1]: ^^^^^^^^^^^^^^^^^
[rank1]: File “/data/modelopt/torch/opt/dynamic.py”, line 861, in extra_repr
[rank1]: val = getattr(self, name)
[rank1]: ^^^^^^^^^^^^^^^^^^^
[rank1]: File “/data/modelopt/torch/opt/dynamic.py”, line 806, in getattr
[rank1]: return manager.get_da_cb(name)(self, value)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File “/data/modelopt/torch/opt/dynamic.py”, line 83, in call
[rank1]: val = cb(self_module, val)
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File “/data/modelopt/torch/sparsity/module.py”, line 35, in _get_weight
[rank1]: masked_weight = weight * mod._weight_mask
[rank1]: ~~~~~~~~~~~~^
[rank1]: RuntimeError: The size of tensor a (0) must match the size of tensor b (64) at non-singleton dimension 1
### If I skip “print(flux)”, error as follows:
[rank1]: Traceback (most recent call last):
[rank1]: File “/data/train_flux.py”, line 438, in
[rank1]: main()
[rank1]: File “/data/train_flux.py”, line 374, in main
[rank1]: accelerator.backward(loss)
[rank1]: File “/root/miniforge3/envs/py312torch250/lib/python3.12/site-packages/accelerate/accelerator.py”, line 2196, in backward
[rank1]: loss.backward(**kwargs)
[rank1]: File “/root/miniforge3/envs/py312torch250/lib/python3.12/site-packages/torch/_tensor.py”, line 581, in backward
[rank1]: torch.autograd.backward(
[rank1]: File “/root/miniforge3/envs/py312torch250/lib/python3.12/site-packages/torch/autograd/init.py”, line 347, in backward
[rank1]: _engine_run_backward(
[rank1]: File “/root/miniforge3/envs/py312torch250/lib/python3.12/site-packages/torch/autograd/graph.py”, line 825, in _engine_run_backward
[rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File “/root/miniforge3/envs/py312torch250/lib/python3.12/site-packages/torch/utils/_contextlib.py”, line 116, in decorate_context
[rank1]: return func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File “/root/miniforge3/envs/py312torch250/lib/python3.12/site-packages/torch/distributed/fsdp/_runtime_utils.py”, line 734, in _post_backward_hook
[rank1]: handle._use_unsharded_grad_views()
[rank1]: File “/root/miniforge3/envs/py312torch250/lib/python3.12/site-packages/torch/distributed/fsdp/_flat_param.py”, line 1982, in _use_unsharded_grad_views
[rank1]: hasattr(module, param_name),
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File “/data/modelopt/torch/opt/dynamic.py”, line 806, in getattr
[rank1]: return manager.get_da_cb(name)(self, value)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File “/data/modelopt/torch/opt/dynamic.py”, line 83, in call
[rank1]: val = cb(self_module, val)
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File “/data/modelopt/torch/sparsity/module.py”, line 34, in _get_weight
[rank1]: masked_weight = weight * mod._weight_mask
[rank1]: ~~~~~~~~~~~~^
[rank1]: RuntimeError: The size of tensor a (2360064) must match the size of tensor b (3072) at non-singleton dimension 1