Federated Learning - Error on Validation

Hello,

We are trying to implement Federated Learning approach on our data but could not complete the full cycle for training. Below are the details for the model and configuration:

  • Model Input: 512x768 grayscale image (3 channels)
  • Model Output: 512x768 black&white image (1 channel)
  • Model Type: Segmentation Network for 2 classes

Config_Train.config

{
  "epochs": 1250,
  "num_training_epoch_per_valid": 20,
  "learning_rate": 1e-4,
  "multi_gpu": false,
  "dynamic_input_shape": false,
  "use_amp": false,
  "train": {
	"loss": {
	  "name": "Dice"
	},
	"optimizer": {
	  "name": "Adam"
	},
	"model": {
	  "name": "SegAhnet",
	  "args": {
		"num_classes": 1,
		"if_use_psp": false,
		"pretrain_weight_name": "{PRETRAIN_WEIGHTS_FILE}",
		"plane": "z",
		"final_activation": "softmax",
		"n_spatial_dim": 2
	  }
	},
	"pre_transforms": [
	  {
		"name": "LoadPng",
		"args": {
		  "fields": [
			"image",
			"label"
		  ]
		}
	  },
	  {
		"name": "ConvertToChannelsFirst",
		"args": {
		  "fields": ["image", "label"]
		}
	  }
	],
	"image_pipeline": {
	  "name": "SegmentationImagePipelineWithCache",
	  "args": {
		"data_list_file_path": "{DATASET_JSON}",
		"data_file_base_dir": "{DATA_ROOT}",
		"data_list_key": "training",
		"output_crop_size": [512, 768],
		"output_image_channels": 3,
		"output_label_channels": 1,
		"output_batch_size": 1,
		"num_workers": 2,
		"prefetch_size": 0,
		"num_cache_objects": 20,
		"replace_percent": 0.25,
		"output_data_dims": 2,
		"batched_by_transforms": false
	  }
	}
  },
  "validate": {
	"metrics": [
	  {
		"name": "ComputeAverageDice",
		"args": {
		  "name": "mean_dice",
		  "is_key_metric": true,
		  "field": "model",
		  "label_field": "label"
		}
	  }
	],
	"pre_transforms": [
	  {
		"ref": "LoadPng"
	  },
	  {
		"name": "ConvertToChannelsFirst",
		"args": {
		  "fields": ["image", "label"]
		}
	  }
	],
	"image_pipeline": {
	  "name": "SegmentationImagePipeline",
	  "args": {
		"data_list_file_path": "{DATASET_JSON}",
		"data_file_base_dir": "{DATA_ROOT}",
		"data_list_key": "validation",
		"output_crop_size": [512, 768],
		"output_image_channels": 3,
		"output_label_channels": 1,
		"output_batch_size": 1,
		"output_data_dims": 2,
		"num_workers": 2,
		"prefetch_size": 0,
		"batched_by_transforms": true
	  }
	},
	"inferer": {
	  "name": "TFSimpleInferer"
	}
  }
}

Error on FL Client:

2021-03-29 15:57:03.462117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1320] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10146 MB memory) -> physical GPU (device: 0, name: TITAN V, pci bus id: 0000:65:00.0, compute capability: 7.0)
data_list_file: /data/dataset1.json
Number of samples: 100
Data Property: {'task': 'segmentation', 'num_channels': 3, 'num_label_channels': 1, 'data_format': 'channels_first', 'label_format': None, 'crop_size': [512, 768], 'num_data_dims': 2}
deterministic transforms: 4; non-deterministic transforms: 0
data_list_file: /data/dataset1.json
Number of samples: 20
Data Property: {'task': 'segmentation', 'num_channels': 3, 'num_label_channels': 1, 'data_format': 'channels_first', 'label_format': None, 'crop_size': [512, 768], 'num_data_dims': 2}
transpose to channels_last. input shape:  (?, 512, 768, 3)
input is channels_last!
Stage 1-0 (?, 256, 384, 64)
Stage 1 (?, 128, 192, 64)
Stage 2 (?, 128, 192, 256)
Stage 3 (?, 64, 96, 512)
Stage 4 (?, 32, 48, 1024)
Stage 5 (?, 16, 24, 2048)
inx (?, 32, 48, 1024)
inputs (?, 32, 48, 20)
->inputs (?, 32, 48, 1044)
inx (?, 32, 48, 1044)
inputs (?, 32, 48, 20)
->inputs (?, 32, 48, 1064)
inx (?, 32, 48, 1064)
inputs (?, 32, 48, 20)
->inputs (?, 32, 48, 1084)
Stage 6 (?, 32, 48, 1084)
inx (?, 64, 96, 512)
inputs (?, 64, 96, 20)
->inputs (?, 64, 96, 532)
inx (?, 64, 96, 532)
inputs (?, 64, 96, 20)
->inputs (?, 64, 96, 552)
inx (?, 64, 96, 552)
inputs (?, 64, 96, 20)
->inputs (?, 64, 96, 572)
Stage 7 (?, 64, 96, 572)
inx (?, 128, 192, 256)
inputs (?, 128, 192, 20)
->inputs (?, 128, 192, 276)
inx (?, 128, 192, 276)
inputs (?, 128, 192, 20)
->inputs (?, 128, 192, 296)
inx (?, 128, 192, 296)
inputs (?, 128, 192, 20)
->inputs (?, 128, 192, 316)
Stage 8 (?, 128, 192, 316)
inx (?, 128, 192, 64)
inputs (?, 128, 192, 20)
->inputs (?, 128, 192, 84)
inx (?, 128, 192, 84)
inputs (?, 128, 192, 20)
->inputs (?, 128, 192, 104)
inx (?, 128, 192, 104)
inputs (?, 128, 192, 20)
->inputs (?, 128, 192, 124)
Stage 9 (?, 128, 192, 124)
inx (?, 256, 384, 64)
inputs (?, 256, 384, 20)
->inputs (?, 256, 384, 84)
inx (?, 256, 384, 84)
inputs (?, 256, 384, 20)
->inputs (?, 256, 384, 104)
inx (?, 256, 384, 104)
inputs (?, 256, 384, 20)
->inputs (?, 256, 384, 124)
Stage 10 (?, 256, 384, 124)
Final activation softmax
Final (?, 512, 768, 1)
transpose back to channels_first
dice_loss targets [None, 1, 512, 768] predictions [None, 1, 512, 768] targets.dtype <dtype: 'float32'> predictions.dtype <dtype: 'float32'>
dice_loss is_channels_first: True skip_background: False is_onehot_targets False
Fitting with single gpu
2021-03-29 15:57:17,276 - SupervisedFitter - INFO - CLEAN START (global_variables_initializer)
Requested train epochs: 5; iterations: 20
2021-03-29 15:57:38,002 - FederatedClient - INFO - Starting to fetch global model.
2021-03-29 15:57:48,647 - Communicator - INFO - Received endpointing_project model at round 0 (373502766 Bytes). GetModel time: 10.64217758178711 seconds
Get global model for round: 0
2021-03-29 15:57:56,152 - AssignVariables - INFO - Vars from remote 1474, Vars from local 1474, vars matched 1474 of 1474 local
2021-03-29 15:57:56,214 - ClientModelManager - INFO - Setting global federated model data (93299139 elements)
2021-03-29 15:57:56,214 - ClientModelManager - INFO - Round 0: local model updated
pull_models completed. Status:True rank:0
2021-03-29 15:57:59,840 - SupervisedFitter - INFO - Winding down training ...
2021-03-29 15:58:05,039 - SupervisedFitter - INFO - Saved final model checkpoint at: /workspace/startup/../run_7/mmar_org1/models/model_final.ckpt
2021-03-29 15:58:07,606 - SupervisedFitter - INFO - Saved model checkpoint at: /workspace/startup/../run_7/mmar_org1/models/model.ckpt
2021-03-29 15:58:07,608 - SupervisedFitter - INFO - Total time for fitting: 11.29s
2021-03-29 15:58:07,609 - SupervisedFitter - INFO - Best validation metric: -1000000 at epoch 0
Traceback (most recent call last):
  File "workflows/fitters/supervised_fitter.py", line 245, in fit
  File "workflows/fitters/supervised_fitter.py", line 535, in _do_fit
  File "workflows/fitters/supervised_fitter.py", line 796, in validate_and_log_tensorboard
  File "workflows/fitters/supervised_fitter.py", line 955, in do_validation
  File "components/inferers/simple_inferer.py", line 20, in infer
  File "components/inferers/tf_predictor.py", line 36, in predict
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
	run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1156, in _run
	(np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (1, 512, 768) for Tensor 'NV_LABEL_INPUT:0', which has shape '(?, 1, 512, 768)'
2021-03-29 15:58:07,614 - FederatedClient - INFO - Starting to fetch global model.
No token for this client in current round. Waiting for server new round.
No token for this client in current round. Waiting for server new round.
No token for this client in current round. Waiting for server new round.
No token for this client in current round. Waiting for server new round.
Client training has been aborted.
2021-03-29 15:58:25,273 - FLClientConfiger - INFO - DETERMINISM IS OFF
2021-03-29 15:58:25,280 - ValidationScheduler - INFO - Starting validation scheduler.
2021-03-29 15:58:25,282 - FederatedClient - INFO - Cross site validation disabled. Submitting empty best model to indicate opt-out to server.
2021-03-29 15:58:25,306 - Communicator - INFO - Server reply to SubmitBestLocalModel:  Received best model from org1.. SubmitBestLocalModel time: 0.02283501625061035 seconds
2021-03-29 15:58:25,382 - FederatedClient - INFO - Not participating in cross site validation.

Tried various configuration options to fix the dimension mismatch error below but could not get it worked out thus far.

**ValueError: Cannot feed value of shape (1, 512, 768) for Tensor 'NV_LABEL_INPUT:0', which has shape '(?, 1, 512, 768)'**

Would you assist me on how to change the config to resolve this issue?

Thanks
Faruk

Hi
Thanks for your interest in clara train sdk
This seems to be a network / data error. to help simplify the problem you should put FL aside and try running regular training. you should get the same error. once you fix it you can go back to running FL

Some observations

  • output 512x768 black&white image (1 channel) is incorrect you have 2 classes here. the black and white. out put should be 2x512x768
  • related you should change the “num_classes”: 1, to “num_classes”: 2 in the model section also in the image pipeline “output_label_channels”: 1, should be 2

Thanks for the response.
As you pointed out, changing “num_classes” parameter value of the model to 2 resolved the issue.
Additionally, testing and validating the config_train.json and other configuration files with the regular Clara training before setting up FL is a very nice feature and should be the very first step. Thanks for this trick as well.