Unable to load checkpoint loader during cross site validation


During cross-validation, we are referencing the checkpoint loader in config_validation.json

    "name": "CheckpointLoader",
    "args": {
      "load_path": "{MMAR_VAL_CKPT}",
      "load_dict": ["model"]

Note on deploying the MMAR to the client the model.pt is present under the models folder. However, we still receive the following error during cross-validation

2021-06-08 17:39:27,829 - ignite.engine.engine.SupervisedEvaluator - INFO - Engine run resuming from iteration 0, epoch 0 until 1 epochs
2021-06-08 17:39:27,829 - ignite.engine.engine.SupervisedEvaluator - ERROR - Engine run is terminating due to exception: [Errno 2] No such file or directory: '/tmp/tmp5k77tgdg/mmar/models/model.pt'
2021-06-08 17:39:27,829 - ignite.engine.engine.SupervisedEvaluator - ERROR - Exception: [Errno 2] No such file or directory: '/tmp/tmp5k77tgdg/mmar/models/model.pt'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 737, in _internal_run
  File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 424, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/opt/monai/monai/handlers/checkpoint_loader.py", line 92, in __call__
    checkpoint = torch.load(self.load_path, map_location=self.map_location)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 579, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp5k77tgdg/mmar/models/model.pt'

Please find the detailed logs attached.
We have previously raised a similar issue but haven’t received any useful input - please help as this is a blocker.

lesion-activity-cross-validation-ClientA(true) (145.4 KB)
lesion-activity-cross-validation-ClientB(true) (100.6 KB)
lesion-activity-server-cross-validation(true) (13.6 KB)

Hi Siddharth,

Can you possibly share your MMAR? It would allow us to investigate the issues with configuration.


Please find MMAR: adminMMAR.zip - Google Drive

The issue is CheckpointLoader configuration:

When running cross site validation, validation is launched with following args:
mmar: temp_mmar_dir, config: config/config_validation.json, MMAR_CKPT_DIR: models, MMAR_CKPT: models/{model_name}

The last argument is important. This is the name of the model copied from the shareable to the temporary directory.
However, in your configuration, you are using:
“name”: “CheckpointLoader”,
“args”: {
“load_path”: “{MMAR_VAL_CKPT}”,
“load_dict”: [“model”]

So the checkpointloader in config_validation.json completely ignores the MMAR_CKPT and instead uses MMAR_VAL_CKPT causing the error.

The fix is to use:
“name”: “CheckpointLoader”,
“disabled”: “{dont_load_ckpt_model}”,
“args”: {
“load_path”: “{MMAR_CKPT}”,
“load_dict”: [“model”]

This is a current limitation we will remove in future. Right now, CheckPointLoader in cross site validation must use MMAR_CKPT as load_path. We will check our documentation and enhance it if this is not captured.

Can you please try changing the CheckpointLoader configuration in config_validation.json and give this another try?

Thank you, this fixes it. Much appreciated!