Pre trained model not working on CLARA GA 4.0

We are trying to load our pre-trained model as using CheckPointLoader as shown below:

config_train.json


     "handlers": [
      {
        "name": "CheckpointLoader",
        "args": {
          "load_path": "{MMAR_CKPT}",
          "load_dict": ["model"]
        }
      },

environment.json

{
    "DATA_ROOT": "/workspace/data/",
    "DATASET_JSON": "/workspace/data/CLARA/datalist.json",
    "PROCESSING_TASK": "segmentation",
    "MMAR_EVAL_OUTPUT_PATH": "eval",
    "MMAR_CKPT_DIR": "models",
    "MMAR_CKPT": "models/unet_weights.pt"
}

However, on starting client training we receive the following error:

siddharth@FLS-1:/workspace/startup$ 2021-07-06 19:44:59,736 - ClientAdminInterface - INFO - Starting client training. rank: 0
training child process ID: 746
starting the client .....
token is: 50dd9a0f-d8cd-4d09-aff3-ba982acf0170 run_number is: 7072021 uid: org1-a listen_port: 35877
2021-07-06 19:45:00,396 - matplotlib - WARNING - Matplotlib created a temporary config/cache directory at /tmp/matplotlib-mxg0cc__ because the default path (/home/siddharth/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2021-07-06 19:45:00,420 - matplotlib.font_manager - INFO - Generating new fontManager, this may take some time...
2021-07-06 19:45:00,771 - ProcessExecutor - INFO - waiting for process to finish
Created the listener on port: 35877
2021-07-06 19:45:02,282 - torch.distributed.nn.jit.instantiator - INFO - Created a temporary directory at /tmp/tmpb0zxaot6
2021-07-06 19:45:02,282 - torch.distributed.nn.jit.instantiator - INFO - Writing /tmp/tmpb0zxaot6/_remote_module_non_sriptable.py
siddharth@FLS-1:/workspace/startup$ no max_epochs specified for SupervisedTrainer args, use global var 'epochs: 10'.
========== Train Config Result ===========
Num Epochs:  10
Use GPU:  True
Multi GPU:  False
Automatic Mixed Precision:  Disabled
Determinism Training:  Enabled
cuDNN BenchMark:  False
CUDA Matmul Allow TF32:  True
cuDNN Allow TF32:  True
Model:  <class 'lesion-activity-clara-fl.NewLesionsUNet.model.NewLesionsUNet'>
Loss:  <class 'lesion-activity-clara-fl.NewLesionsUNet.model.MyDiCELoss'>
Optimizer:  <class 'torch.optim.adam.Adam'>
LR Scheduler:  <class 'NoneType'>
Train Dataset:  <class 'lesion-activity-clara-fl.NewLesionsUNet.dataset.LongitudinalCroppingDataset'>
Train DataLoader:  <class 'monai.data.dataloader.DataLoader'>
Train Transform #1: <class 'monai.transforms.utility.dictionary.ToTensord'>
Validate Dataset:  <class 'monai.data.dataset.Dataset'>
Validate DataLoader:  <class 'monai.data.dataloader.DataLoader'>
Validate Transform #1: <class 'monai.transforms.io.dictionary.LoadImaged'>
Validate Transform #2: <class 'lesion-activity-clara-fl.NewLesionsUNet.pre_transform.GetMask'>
Validate Transform #3: <class 'monai.transforms.intensity.dictionary.NormalizeIntensityd'>
Validate Transform #4: <class 'monai.transforms.utility.dictionary.ToTensord'>
Train Handler #1: <class 'monai.handlers.checkpoint_loader.CheckpointLoader'>
Train Handler #2: <class 'monai.handlers.validation_handler.ValidationHandler'>
Train Handler #3: <class 'monai.handlers.checkpoint_saver.CheckpointSaver'>
Train Handler #4: <class 'monai.handlers.stats_handler.StatsHandler'>
Train Handler #5: <class 'monai.handlers.tensorboard_handlers.TensorBoardStatsHandler'>
Validate Handler #1: <class 'monai.handlers.checkpoint_loader.CheckpointLoader'>
Validate Handler #2: <class 'monai.handlers.stats_handler.StatsHandler'>
Validate Handler #3: <class 'monai.handlers.tensorboard_handlers.TensorBoardStatsHandler'>
Validate Handler #4: <class 'monai.handlers.checkpoint_saver.CheckpointSaver'>
Validate Post Transforms #1: <class 'monai.transforms.post.dictionary.Activationsd'>
Validate Post Transforms #2: <class 'monai.transforms.post.dictionary.AsDiscreted'>
Validate Inferer:  <class 'monai.inferers.inferer.SlidingWindowInferer'>
Validate Key Metric:  <class 'monai.handlers.mean_dice.MeanDice'>
Train Inferer:  <class 'monai.inferers.inferer.SimpleInferer'>
========== End of Train Config Result ===========
2021-07-06 19:45:23,240 - FederatedClient - INFO - Starting to fetch global model.
2021-07-06 19:45:30,199 - Communicator - INFO - Received lesion_activity model at round 0 (48777665 Bytes). GetModel time: 6.951814413070679 seconds
Get global model for round: 0
pull_models completed. Status:True rank:0
2021-07-06 19:45:30,231 - ClientTrainer - INFO - ClientTrainer abort signal: False
2021-07-06 19:45:30,231 - AssignVariables - INFO - Vars 56 of 56 assigned.
2021-07-06 19:45:30,260 - ModelShareableManager - INFO - Setting global federated model data (12193313 elements)
2021-07-06 19:45:30,260 - ModelShareableManager - INFO - Round 0: local model updated
2021-07-06 19:45:30,260 - ignite.engine.engine.SupervisedEvaluator - INFO - Engine run resuming from iteration 0, epoch 0 until 1 epochs
new
['model']
2021-07-06 19:45:30,283 - ignite.engine.engine.SupervisedEvaluator - ERROR - Engine run is terminating due to exception: 'list' object has no attribute 'items'
2021-07-06 19:45:30,283 - ignite.engine.engine.SupervisedEvaluator - ERROR - Exception: 'list' object has no attribute 'items'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 737, in _internal_run
    self._fire_event(Events.STARTED)
  File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 424, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/opt/monai/monai/handlers/checkpoint_loader.py", line 119, in __call__
    print(list(self.load_dict.items())[0])
AttributeError: 'list' object has no attribute 'items'
Send model to server.
2021-07-06 19:45:30,299 - FederatedClient - INFO - Starting to push model.
Traceback (most recent call last):
  File "<nvflare-0.1.4>/nvflare/private/fed/client/fed_client.py", line 127, in federated_step
  File "apps/fed_learn/trainers/client_trainer.py", line 108, in train
  File "apps/fed_learn/trainers/supervised_fitter.py", line 69, in fit
  File "/opt/monai/monai/engines/evaluator.py", line 120, in run
    super().run()
  File "/opt/monai/monai/engines/workflow.py", line 206, in run
    super().run(data=self.data_loader, max_epochs=self.state.max_epochs)
  File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 702, in run
    return self._internal_run()
  File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 775, in _internal_run
    self._handle_exception(e)
  File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 467, in _handle_exception
    self._fire_event(Events.EXCEPTION_RAISED, e)
  File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 424, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/opt/monai/monai/handlers/stats_handler.py", line 145, in exception_raised
    raise e
  File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 737, in _internal_run
    self._fire_event(Events.STARTED)
  File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 424, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/opt/monai/monai/handlers/checkpoint_loader.py", line 119, in __call__
    print(list(self.load_dict.items())[0])
AttributeError: 'list' object has no attribute 'items'
Traceback (most recent call last):
  File "<nvflare-0.1.4>/nvflare/private/fed/client/fed_client.py", line 229, in admin_run
  File "<nvflare-0.1.4>/nvflare/private/fed/client/fed_client.py", line 178, in run_federated_steps
  File "<nvflare-0.1.4>/nvflare/private/fed/client/fed_client.py", line 135, in federated_step
  File "<nvflare-0.1.4>/nvflare/private/fed/client/fed_client_base.py", line 217, in push_models
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "<nvflare-0.1.4>/nvflare/private/fed/client/fed_client_base.py", line 161, in push_remote_model
  File "<nvflare-0.1.4>/nvflare/private/fed/client/communicator.py", line 292, in submitUpdate
  File "<nvflare-0.1.4>/nvflare/private/fed/client/data_assembler.py", line 33, in get_contribution_data
  File "<nvflare-0.1.4>/nvflare/private/fed/client/client_model_manager.py", line 103, in read_current_model
TypeError: argument of type 'NoneType' is not iterable
2021-07-06 19:45:31,939 - ProcessExecutor - INFO - process finished with return code 0
cross_validation child process ID: 884
starting the client .....
token is: 50dd9a0f-d8cd-4d09-aff3-ba982acf0170 run_number is: 7072021 uid: org1-a
2021-07-06 19:45:32,559 - matplotlib - WARNING - Matplotlib created a temporary config/cache directory at /tmp/matplotlib-y8uoj6bc because the default path (/home/siddharth/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2021-07-06 19:45:32,584 - matplotlib.font_manager - INFO - Generating new fontManager, this may take some time...
2021-07-06 19:45:34,204 - CrossSiteValManager - INFO - Cross validation timeout is set to 10 minutes.
2021-07-06 19:45:34,204 - CrossSiteValManager - INFO - File to load model registry doesn't exist at /workspace/startup/../run_7072021/mmar_org1-a/cross_validation/model_registry.pkl.Using empty registry.
2021-07-06 19:45:34,206 - CrossSiteValManager - INFO - Exception in loading model: 'NoneType' object has no attribute 'model_key'
2021-07-06 19:45:34,224 - Communicator - INFO - Server reply to SubmitBestLocalModel:  Received best model from org1-a.. SubmitBestLocalModel time: 0.01444864273071289 seconds
2021-07-06 19:45:34,227 - FederatedClient - INFO - Getting other models from server for cross validation.
2021-07-06 19:45:34,246 - Communicator - INFO - Received 0 models for validation. GetValidationModels time: 0.018537282943725586 seconds
2021-07-06 19:45:34,248 - FederatedClient - INFO - Server has no models available currently. Waiting 60 secs before asking again.
2021-07-06 19:46:34,316 - FederatedClient - INFO - Getting other models from server for cross validation.
2021-07-06 19:46:34,339 - Communicator - INFO - Received 0 models for validation. GetValidationModels time: 0.021426677703857422 seconds
2021-07-06 19:46:34,341 - FederatedClient - INFO - Finished cross site validation.

Please suggest as this is a blocker for us.

This has been resolved now. Thank you.
Key points:

  • Remove CheckPointLoader from validate section in config_train.json

  • Introduce a new environment variable to be used for CheckPointLoader in train section on config_train.json