Cross validation fails in Clara 4.0 GA, unable to find model.pt

Hello,

We are using CLARA 4.0 GA. We have our environment.json defined as:

{
“DATA_ROOT”: “/workspace/data/”,
“DATASET_JSON”: “/workspace/data/datalist.json”,
“PROCESSING_TASK”: “segmentation”,
“MMAR_EVAL_OUTPUT_PATH”: “eval”,
“MMAR_CKPT_DIR”: “models”,
“MMAR_VAL_CKPT”: “models/model.pt”
}

As you can see our config_validation.json uses the following path - MMAR_VAL_CKPT,

  {
    "name": "CheckpointLoader",
    "args": {
      "load_path": "{MMAR_VAL_CKPT}",
      "load_dict": ["model"]
    }
  },

Below are the client logs,

 2021-05-31 09:03:19,001 - ignite.engine.engine.SupervisedEvaluator - INFO - Engine run resuming from iteration 0, epoch 0 until 1 epochs
    2021-05-31 09:03:19,001 - ignite.engine.engine.SupervisedEvaluator - ERROR - Engine run is terminating due to exception: [Errno 2] No such file or directory: '/tmp/tmpn7w00of7/mmar/models/model.pt'
    2021-05-31 09:03:19,001 - ignite.engine.engine.SupervisedEvaluator - ERROR - Exception: [Errno 2] No such file or directory: '/tmp/tmpn7w00of7/mmar/models/model.pt'
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 737, in _internal_run
        self._fire_event(Events.STARTED)
      File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 424, in _fire_event
        func(*first, *(event_args + others), **kwargs)
      File "/opt/monai/monai/handlers/checkpoint_loader.py", line 92, in __call__
        checkpoint = torch.load(self.load_path, map_location=self.map_location)
      File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 579, in load
        with _open_file_like(f, 'rb') as opened_file:
      File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 230, in _open_file_like
        return _open_file(name_or_buffer, mode)
      File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 211, in __init__
        super(_open_file, self).__init__(open(name, mode))
    FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpn7w00of7/mmar/models/model.pt'
    Exception in executing validation: 
    2021-05-31 09:03:19,010 - ClientValidator - INFO - Do validation ...
    2021-05-31 09:03:19,010 - CrossSiteValManager - INFO - Exception in validating FL_global_model's model: local variable 'metrics' referenced before assignment
    2021-05-31 09:03:19,013 - FederatedClient - INFO - Submitting cross validation results to server.
    2021-05-31 09:03:19,032 - Communicator - INFO - Received comments:  Received Cross Validation results from org1-b.. SubmitCrossSiteValidationResults time: 0.017667293548583984 seconds
    2021-05-31 09:03:19,032 - FederatedClient - INFO - More models available with server.
    2021-05-31 09:03:19,034 - FederatedClient - INFO - Getting other models from server for cross validation.
    2021-05-31 09:03:22,793 - Communicator - INFO - Received 1 models for validation. GetValidationModels time: 3.7578117847442627 seconds
    ========== Validate Config Result ===========
    Use GPU:  True
    Multi GPU:  False
    Automatic Mixed Precision:  Disabled
    Determinism Evaluation:  Disabled
    cuDNN BenchMark:  False
    CUDA Matmul Allow TF32:  True
    cuDNN Allow TF32:  True
    Model:  <class 'lesion-activity-clara-fl.NewLesionsUNet.model.NewLesionsUNet'>
    Dataset:  <class 'monai.data.dataset.Dataset'>
    DataLoader:  <class 'monai.data.dataloader.DataLoader'>
    Validate Transform #1: <class 'monai.transforms.io.dictionary.LoadImaged'>
    Validate Transform #2: <class 'lesion-activity-clara-fl.NewLesionsUNet.pre_transform.GetMask'>
    Validate Transform #3: <class 'monai.transforms.intensity.dictionary.NormalizeIntensityd'>
    Validate Transform #4: <class 'monai.transforms.utility.dictionary.ToTensord'>
    Validate Handler #1: <class 'monai.handlers.stats_handler.StatsHandler'>
    Validate Handler #2: <class 'monai.handlers.checkpoint_loader.CheckpointLoader'>
    Validate Handler #3: <class 'monai.handlers.segmentation_saver.SegmentationSaver'>
    Validate Post Transforms #1: <class 'monai.transforms.post.dictionary.Activationsd'>
    Validate Post Transforms #2: <class 'monai.transforms.post.dictionary.AsDiscreted'>
    Validate Inferer:  <class 'monai.inferers.inferer.SlidingWindowInferer'>
    Validate Key Metric:  <class 'monai.handlers.mean_dice.MeanDice'>
    Validate Additional Metric #My own validation mean dice loss: <class 'lesion-activity-clara-fl.NewLesionsUNet.model.MyValLoss'>
    ========== End of Validate Config Result ===========
    2021-05-31 09:03:23,181 - ignite.engine.engine.SupervisedEvaluator - INFO - Engine run resuming from iteration 0, epoch 0 until 1 epochs
    2021-05-31 09:03:23,181 - ignite.engine.engine.SupervisedEvaluator - ERROR - Engine run is terminating due to exception: [Errno 2] No such file or directory: '/tmp/tmp8h3r67ma/mmar/models/model.pt'
    2021-05-31 09:03:23,181 - ignite.engine.engine.SupervisedEvaluator - ERROR - Exception: [Errno 2] No such file or directory: '/tmp/tmp8h3r67ma/mmar/models/model.pt'
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 737, in _internal_run
        self._fire_event(Events.STARTED)
      File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 424, in _fire_event
        func(*first, *(event_args + others), **kwargs)
      File "/opt/monai/monai/handlers/checkpoint_loader.py", line 92, in __call__
        checkpoint = torch.load(self.load_path, map_location=self.map_location)
      File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 579, in load
        with _open_file_like(f, 'rb') as opened_file:
      File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 230, in _open_file_like
        return _open_file(name_or_buffer, mode)
      File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 211, in __init__
        super(_open_file, self).__init__(open(name, mode))
    FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp8h3r67ma/mmar/models/model.pt'

On the client machines, we are able to see the mmar_org1-a and mmar_org1-b (under the run folder) on deploy adminMMAR client command. These folders contain the models folder with the model.pt present.

  1. Where is this error being thrown from?
  2. Please guide us on how we can resolve this.

Hi
Thanks for your interest in Clara Train SDK.

In V4 GA we have made some changes to FL. Please make sure you are using / updated your mmars to follow our mmars from NGC or example in the notebooks clara-train-examples/config_train.json at master · NVIDIA/clara-train-examples · GitHub

as you see there is new handlers to load a checkpoint and save the model in from a check point

"handlers": [
      {
        "name": "CheckpointLoader",
        "disabled": "{dont_load_ckpt_model}",
        "args": {
          "load_path": "{MMAR_CKPT}",
          "load_dict": ["model"]
        }
      },
      {
        "name": "CheckpointSaver",
        "rank": 0,
        "args": {
          "save_dir": "{MMAR_CKPT_DIR}",
          "save_dict": ["model", "optimizer", "lr_scheduler"],
          "save_final": true,
          "save_interval": 5
        }
      },

Hope this helps

Thanks aharouni,

What should be the value of MMAR_CKPT in environment.json. In the example, there is no value provided for the same.

The ipynb uses MMAR_CKPT as MMAR_CKPT=models/${CONFIG_FILE_NAME::-5}/model.pt clara-train-examples/validate.sh at 5967cbbd051566596b6a5c363f06002ac9f234c5 · NVIDIA/clara-train-examples · GitHub

On using the value “MMAR_CKPT”: “models/model.pt”, we receive the following error

========== Train Config Result ===========
Num Epochs: 10
Use GPU: True
Multi GPU: False
Automatic Mixed Precision: Disabled
Determinism Training: Enabled
cuDNN BenchMark: False
CUDA Matmul Allow TF32: True
cuDNN Allow TF32: True
Model: <class ‘lesion-activity-clara-fl.NewLesionsUNet.model.NewLesionsUNet’>
Loss: <class ‘lesion-activity-clara-fl.NewLesionsUNet.model.MyDiCELoss’>
Optimizer: <class ‘torch.optim.adam.Adam’>
LR Scheduler: <class ‘NoneType’>
Train Dataset: <class ‘lesion-activity-clara-fl.NewLesionsUNet.dataset.LongitudinalCroppingDataset’>
Train DataLoader: <class ‘monai.data.dataloader.DataLoader’>
Train Transform #1: <class ‘monai.transforms.utility.dictionary.ToTensord’>
Validate Dataset: <class ‘monai.data.dataset.Dataset’>
Validate DataLoader: <class ‘monai.data.dataloader.DataLoader’>
Validate Transform #1: <class ‘monai.transforms.io.dictionary.LoadImaged’>
Validate Transform #2: <class ‘lesion-activity-clara-fl.NewLesionsUNet.pre_transform.GetMask’>
Validate Transform #3: <class ‘monai.transforms.intensity.dictionary.NormalizeIntensityd’>
Validate Transform #4: <class ‘monai.transforms.utility.dictionary.ToTensord’>
Train Handler #1: <class ‘monai.handlers.validation_handler.ValidationHandler’>
Train Handler #2: <class ‘monai.handlers.checkpoint_saver.CheckpointSaver’>
Train Handler #3: <class ‘monai.handlers.stats_handler.StatsHandler’>
Train Handler #4: <class ‘monai.handlers.tensorboard_handlers.TensorBoardStatsHandler’>
Validate Handler #1: <class ‘monai.handlers.stats_handler.StatsHandler’>
Validate Handler #2: <class ‘monai.handlers.checkpoint_loader.CheckpointLoader’>
Validate Handler #3: <class ‘monai.handlers.tensorboard_handlers.TensorBoardStatsHandler’>
Validate Handler #4: <class ‘monai.handlers.checkpoint_saver.CheckpointSaver’>
Validate Post Transforms #1: <class ‘monai.transforms.post.dictionary.Activationsd’>
Validate Post Transforms #2: <class ‘monai.transforms.post.dictionary.AsDiscreted’>
Validate Inferer: <class ‘monai.inferers.inferer.SlidingWindowInferer’>
Validate Key Metric: <class ‘monai.handlers.mean_dice.MeanDice’>
Validate Additional Metric #My own validation mean dice loss: <class ‘lesion-activity-clara-fl.NewLesionsUNet.model.MyValLoss’>
Train Inferer: <class ‘monai.inferers.inferer.SimpleInferer’>
========== End of Train Config Result ===========
2021-06-02 00:04:23,791 - FederatedClient - INFO - Starting to fetch global model.
2021-06-02 00:04:25,171 - Communicator - INFO - Received lesion_activity model at round 0 (19259350 Bytes). GetModel time: 1.372786283493042 seconds
Get global model for round: 0
pull_models completed. Status:True rank:0
2021-06-02 00:04:25,179 - ClientTrainer - INFO - ClientTrainer abort signal: False
2021-06-02 00:04:25,179 - AssignVariables - INFO - Vars 148 of 0 assigned.
2021-06-02 00:04:25,180 - ignite.engine.engine.SupervisedEvaluator - INFO - Engine run resuming from iteration 0, epoch 0 until 1 epochs
2021-06-02 00:04:25,210 - ignite.engine.engine.SupervisedEvaluator - ERROR - Engine run is terminating due to exception: ‘list’ object has no attribute ‘items’
2021-06-02 00:04:25,210 - ignite.engine.engine.SupervisedEvaluator - ERROR - Exception: ‘list’ object has no attribute ‘items’
Traceback (most recent call last):
File “/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py”, line 737, in _internal_run
self._fire_event(Events.STARTED)
File “/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py”, line 424, in _fire_event
func(*first, *(event_args + others), **kwargs)
File “/opt/monai/monai/handlers/checkpoint_loader.py”, line 110, in call
Checkpoint.load_objects(to_load=self.load_dict, checkpoint=checkpoint, strict=self.strict)
File “/opt/conda/lib/python3.8/site-packages/ignite/handlers/checkpoint.py”, line 542, in load_objects
Checkpoint._check_objects(to_load, “load_state_dict”)
File “/opt/conda/lib/python3.8/site-packages/ignite/handlers/checkpoint.py”, line 497, in _check_objects
for k, obj in objs.items():
AttributeError: ‘list’ object has no attribute ‘items’
Send model to server.
2021-06-02 00:04:25,214 - FederatedClient - INFO - Starting to push model.
Traceback (most recent call last):
File “<nvflare-0.1.4>/nvflare/private/fed/client/fed_client.py”, line 127, in federated_step
File “apps/fed_learn/trainers/client_trainer.py”, line 108, in train
File “apps/fed_learn/trainers/supervised_fitter.py”, line 69, in fit
File “/opt/monai/monai/engines/evaluator.py”, line 120, in run
super().run()
File “/opt/monai/monai/engines/workflow.py”, line 206, in run
super().run(data=self.data_loader, max_epochs=self.state.max_epochs)
File “/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py”, line 702, in run
return self._internal_run()
File “/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py”, line 775, in _internal_run
self._handle_exception(e)
File “/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py”, line 467, in _handle_exception
self._fire_event(Events.EXCEPTION_RAISED, e)
File “/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py”, line 424, in _fire_event
func(*first, *(event_args + others), **kwargs)
File “/opt/monai/monai/handlers/stats_handler.py”, line 145, in exception_raised
raise e
File “/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py”, line 737, in _internal_run
self._fire_event(Events.STARTED)
File “/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py”, line 424, in _fire_event
func(*first, *(event_args + others), **kwargs)
File “/opt/monai/monai/handlers/checkpoint_loader.py”, line 110, in call
Checkpoint.load_objects(to_load=self.load_dict, checkpoint=checkpoint, strict=self.strict)
File “/opt/conda/lib/python3.8/site-packages/ignite/handlers/checkpoint.py”, line 542, in load_objects
Checkpoint._check_objects(to_load, “load_state_dict”)
File “/opt/conda/lib/python3.8/site-packages/ignite/handlers/checkpoint.py”, line 497, in _check_objects
for k, obj in objs.items():
AttributeError: ‘list’ object has no attribute ‘items’
Traceback (most recent call last):
File “<nvflare-0.1.4>/nvflare/private/fed/client/fed_client.py”, line 229, in admin_run
File “<nvflare-0.1.4>/nvflare/private/fed/client/fed_client.py”, line 178, in run_federated_steps
File “<nvflare-0.1.4>/nvflare/private/fed/client/fed_client.py”, line 135, in federated_step
File “<nvflare-0.1.4>/nvflare/private/fed/client/fed_client_base.py”, line 217, in push_models
File “/opt/conda/lib/python3.8/multiprocessing/pool.py”, line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File “/opt/conda/lib/python3.8/multiprocessing/pool.py”, line 771, in get
raise self._value
File “/opt/conda/lib/python3.8/multiprocessing/pool.py”, line 125, in worker
result = (True, func(*args, **kwds))
File “/opt/conda/lib/python3.8/multiprocessing/pool.py”, line 48, in mapstar
return list(map(*args))
File “<nvflare-0.1.4>/nvflare/private/fed/client/fed_client_base.py”, line 161, in push_remote_model
File “<nvflare-0.1.4>/nvflare/private/fed/client/communicator.py”, line 292, in submitUpdate
File “<nvflare-0.1.4>/nvflare/private/fed/client/data_assembler.py”, line 33, in get_contribution_data
File “<nvflare-0.1.4>/nvflare/private/fed/client/client_model_manager.py”, line 103, in read_current_model
TypeError: argument of type ‘NoneType’ is not iterable

Hi

The mmar structure should be models/${CONFIG_FILE_NAME::-5}/model.pt
The notebooks changes this to show may different cases with in the same mmar. However, For FL there may be slightly different setting for FL. Could you try "load_path": "{MMAR_VAL_CKPT}",

as

{
  "name": "CheckpointLoader",
  "disabled": "{dont_load_ckpt_model}",
  "args": {
    "load_path": "{MMAR_VAL_CKPT}",
    "load_dict": ["model"]
  }
},

Please let us know if you still get an error

The following error is still received - Exception: ‘list’ object has no attribute ‘items’

 training child process ID: 389
    starting the client .....
    token is: 09ea2154-b2ba-46c3-aba3-aa8f81396b96 run_number is: 918 uid: org1-b listen_port: 58599
    2021-06-02 23:33:55,247 - matplotlib - WARNING - Matplotlib created a temporary config/cache directory at /tmp/matplotlib-1bsafq55 because the default path (/home/siddharth/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
    2021-06-02 23:33:55,292 - matplotlib.font_manager - INFO - Generating new fontManager, this may take some time...
    2021-06-02 23:33:55,386 - ProcessExecutor - INFO - waiting for process to finish
    Created the listener on port: 58599
    2021-06-02 23:33:58,542 - torch.distributed.nn.jit.instantiator - INFO - Created a temporary directory at /tmp/tmpv1klv4lo
    2021-06-02 23:33:58,542 - torch.distributed.nn.jit.instantiator - INFO - Writing /tmp/tmpv1klv4lo/_remote_module_non_sriptable.py
    no max_epochs specified for SupervisedTrainer args, use global var 'epochs: 10'.
    ========== Train Config Result ===========
    Num Epochs:  10
    Use GPU:  True
    Multi GPU:  False
    Automatic Mixed Precision:  Disabled
    Determinism Training:  Enabled
    cuDNN BenchMark:  False
    CUDA Matmul Allow TF32:  True
    cuDNN Allow TF32:  True
    Model:  <class 'lesion-activity-clara-fl.NewLesionsUNet.model.NewLesionsUNet'>
    Loss:  <class 'lesion-activity-clara-fl.NewLesionsUNet.model.MyDiCELoss'>
    Optimizer:  <class 'torch.optim.adam.Adam'>
    LR Scheduler:  <class 'NoneType'>
    Train Dataset:  <class 'lesion-activity-clara-fl.NewLesionsUNet.dataset.LongitudinalCroppingDataset'>
    Train DataLoader:  <class 'monai.data.dataloader.DataLoader'>
    Train Transform #1: <class 'monai.transforms.utility.dictionary.ToTensord'>
    Validate Dataset:  <class 'monai.data.dataset.Dataset'>
    Validate DataLoader:  <class 'monai.data.dataloader.DataLoader'>
    Validate Transform #1: <class 'monai.transforms.io.dictionary.LoadImaged'>
    Validate Transform #2: <class 'lesion-activity-clara-fl.NewLesionsUNet.pre_transform.GetMask'>
    Validate Transform #3: <class 'monai.transforms.intensity.dictionary.NormalizeIntensityd'>
    Validate Transform #4: <class 'monai.transforms.utility.dictionary.ToTensord'>
    Train Handler #1: <class 'monai.handlers.validation_handler.ValidationHandler'>
    Train Handler #2: <class 'monai.handlers.checkpoint_saver.CheckpointSaver'>
    Train Handler #3: <class 'monai.handlers.stats_handler.StatsHandler'>
    Train Handler #4: <class 'monai.handlers.tensorboard_handlers.TensorBoardStatsHandler'>
    Validate Handler #1: <class 'monai.handlers.stats_handler.StatsHandler'>
    Validate Handler #2: <class 'monai.handlers.checkpoint_loader.CheckpointLoader'>
    Validate Handler #3: <class 'monai.handlers.tensorboard_handlers.TensorBoardStatsHandler'>
    Validate Handler #4: <class 'monai.handlers.checkpoint_saver.CheckpointSaver'>
    Validate Post Transforms #1: <class 'monai.transforms.post.dictionary.Activationsd'>
    Validate Post Transforms #2: <class 'monai.transforms.post.dictionary.AsDiscreted'>
    Validate Inferer:  <class 'monai.inferers.inferer.SlidingWindowInferer'>
    Validate Key Metric:  <class 'monai.handlers.mean_dice.MeanDice'>
    Validate Additional Metric #My own validation mean dice loss: <class 'lesion-activity-clara-fl.NewLesionsUNet.model.MyValLoss'>
    Train Inferer:  <class 'monai.inferers.inferer.SimpleInferer'>
    ========== End of Train Config Result ===========
    2021-06-02 23:35:20,093 - FederatedClient - INFO - Starting to fetch global model.
    2021-06-02 23:35:23,006 - Communicator - INFO - Received lesion_activity model at round 0 (19259350 Bytes). GetModel time: 2.9035820960998535 seconds
    Get global model for round: 0
    pull_models completed. Status:True rank:0
    2021-06-02 23:35:23,023 - ClientTrainer - INFO - ClientTrainer abort signal: False
    2021-06-02 23:35:23,024 - AssignVariables - INFO - Vars 148 of 0 assigned.
    2021-06-02 23:35:23,026 - ignite.engine.engine.SupervisedEvaluator - INFO - Engine run resuming from iteration 0, epoch 0 until 1 epochs
    2021-06-02 23:35:23,159 - ignite.engine.engine.SupervisedEvaluator - ERROR - Engine run is terminating due to exception: 'list' object has no attribute 'items'
    2021-06-02 23:35:23,159 - ignite.engine.engine.SupervisedEvaluator - ERROR - Exception: 'list' object has no attribute 'items'
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 737, in _internal_run
    self._fire_event(Events.STARTED)
      File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 424, in _fire_event
    func(*first, *(event_args + others), **kwargs)
      File "/opt/monai/monai/handlers/checkpoint_loader.py", line 110, in __call__
    Checkpoint.load_objects(to_load=self.load_dict, checkpoint=checkpoint, strict=self.strict)
      File "/opt/conda/lib/python3.8/site-packages/ignite/handlers/checkpoint.py", line 542, in load_objects
    Checkpoint._check_objects(to_load, "load_state_dict")
      File "/opt/conda/lib/python3.8/site-packages/ignite/handlers/checkpoint.py", line 497, in _check_objects
    for k, obj in objs.items():
    AttributeError: 'list' object has no attribute 'items'
    Send model to server.
    2021-06-02 23:35:23,162 - FederatedClient - INFO - Starting to push model.
    Traceback (most recent call last):
      File "<nvflare-0.1.4>/nvflare/private/fed/client/fed_client.py", line 127, in federated_step
      File "apps/fed_learn/trainers/client_trainer.py", line 108, in train
      File "apps/fed_learn/trainers/supervised_fitter.py", line 69, in fit
      File "/opt/monai/monai/engines/evaluator.py", line 120, in run
    super().run()
      File "/opt/monai/monai/engines/workflow.py", line 206, in run
    super().run(data=self.data_loader, max_epochs=self.state.max_epochs)
      File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 702, in run
    return self._internal_run()
      File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 775, in _internal_run
    self._handle_exception(e)
      File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 467, in _handle_exception
    self._fire_event(Events.EXCEPTION_RAISED, e)
      File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 424, in _fire_event
    func(*first, *(event_args + others), **kwargs)
      File "/opt/monai/monai/handlers/stats_handler.py", line 145, in exception_raised
    raise e
      File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 737, in _internal_run
    self._fire_event(Events.STARTED)
      File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 424, in _fire_event
    func(*first, *(event_args + others), **kwargs)
      File "/opt/monai/monai/handlers/checkpoint_loader.py", line 110, in __call__
    Checkpoint.load_objects(to_load=self.load_dict, checkpoint=checkpoint, strict=self.strict)
      File "/opt/conda/lib/python3.8/site-packages/ignite/handlers/checkpoint.py", line 542, in load_objects
    Checkpoint._check_objects(to_load, "load_state_dict")
      File "/opt/conda/lib/python3.8/site-packages/ignite/handlers/checkpoint.py", line 497, in _check_objects
    for k, obj in objs.items():
    AttributeError: 'list' object has no attribute 'items'
    Traceback (most recent call last):
      File "<nvflare-0.1.4>/nvflare/private/fed/client/fed_client.py", line 229, in admin_run
      File "<nvflare-0.1.4>/nvflare/private/fed/client/fed_client.py", line 178, in run_federated_steps
      File "<nvflare-0.1.4>/nvflare/private/fed/client/fed_client.py", line 135, in federated_step
      File "<nvflare-0.1.4>/nvflare/private/fed/client/fed_client_base.py", line 217, in push_models
      File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
      File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
      File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
      File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
      File "<nvflare-0.1.4>/nvflare/private/fed/client/fed_client_base.py", line 161, in push_remote_model
      File "<nvflare-0.1.4>/nvflare/private/fed/client/communicator.py", line 292, in submitUpdate
      File "<nvflare-0.1.4>/nvflare/private/fed/client/data_assembler.py", line 33, in get_contribution_data
      File "<nvflare-0.1.4>/nvflare/private/fed/client/client_model_manager.py", line 103, in read_current_model
    TypeError: argument of type 'NoneType' is not iterable
    2021-06-02 23:35:26,156 - ProcessExecutor - INFO - process finished with return code 0

Environment.json

{
    "DATA_ROOT": "/workspace/data/",
    "DATASET_JSON": "/workspace/data/datalist.json",
    "PROCESSING_TASK": "segmentation",
    "MMAR_EVAL_OUTPUT_PATH": "eval",
    "MMAR_CKPT_DIR": "models",
    "MMAR_VAL_CKPT": "models/model.pt"
}

Config_train.json

  {
    "name": "CheckpointLoader",
    "disabled": "{dont_load_ckpt_model}",
    "args": {
      "load_path": "{MMAR_VAL_CKPT}",
      "load_dict": ["model"]
    }
  },
  {
    "name": "CheckpointSaver",
    "rank": 0,
    "args": {
      "save_dir": "{MMAR_CKPT_DIR}",
      "save_dict": ["model"],
      "save_key_metric": true
    }
  }
]

Please help!

Hi
Thanks for continue to test this. Seems like FL has a special config

please remove the

"MMAR_VAL_CKPT": "models/model.pt"

from the environment.json

Please let us know how it goes

We cannot remove the MMAR_VAL_CKPT as it is referenced in the config_train under

{
        "name": "CheckpointLoader",
        "disabled": "{dont_load_ckpt_model}",
        "args": {
          "load_path": "{MMAR_VAL_CKPT}",
          "load_dict": ["model"]
        }
      }

and will throw an error before training - AssertionError: must provide a clear path to load checkpoint.

Please note that this value is also used in the config_validation.json.