Difference in FL results between CLARA EAv2 v/s CLARA 4.0 GA


We have an interesting observation and would like your inputs on the same-

Using the same dataset (on both clients) in both scenario’s, the following observation is made:

When performing FL run on CLARA v4.0-EA2

{'org1-a': {'org1-a': {'validation': {'monai_mean_dice': 0.569320023059845}}}}
{'org1-b': {'org1-b': {'validation': {'monai_mean_dice': 0.5226138830184937}}}}
{'org1-a': {'org1-b': {'validation': {'monai_mean_dice': 0.5226138830184937}}}}
{'org1-b': {'org1-a': {'validation': {'monai_mean_dice': 0.569320023059845}}}}
{'org1-a': {'server': {'validation': {'monai_mean_dice': 0.5898884534835815}}}}
{'org1-b': {'server': {'validation': {'monai_mean_dice': 0.5898662209510803}}}}

When performing FL run on CLARA v4.0,

{'org1-a': {'FL_global_model': {'validation': {"Monai's validation mean dice loss": 0.5254678130149841}}}}
{'org1-b': {'FL_global_model': {'validation': {"Monai's validation mean dice loss": 0.5254336595535278}}}}
{'org1-a': {'org1-b': {'validation': {"Monai's validation mean dice loss": 0.5733405947685242}}}}
{'org1-b': {'org1-b': {'validation': {"Monai's validation mean dice loss": 0.573359489440918}}}}
{'org1-b': {'org1-a': {'validation': {"Monai's validation mean dice loss": 0.573359489440918}}}}
{'org1-a': {'org1-a': {'validation': {"Monai's validation mean dice loss": 0.5733405947685242}}}}

When performing FL run using IntimeModelSelectionHandler on CLARA v4.0,

{'org1-a': {'FL_global_model': {'validation': {"Monai's validation mean dice loss": 0.5254678130149841}
{'org1-b': {'FL_global_model': {'validation': {"Monai's validation mean dice loss": 0.5254336595535278,}}
{'org1-b': {'best_FL_global_model': {'validation': {"Monai's validation mean dice loss": 0.573359489440918,}}}
{'org1-a': {'best_FL_global_model': {'validation': {"Monai's validation mean dice loss": 0.5733405947685242,}}
{'org1-b': {'org1-b': {'validation': {"Monai's validation mean dice loss": 0.573359489440918,}}
{'org1-a': {'org1-b': {'validation': {"Monai's validation mean dice loss": 0.5733405947685242}
{'org1-b': {'org1-a': {'validation': {"Monai's validation mean dice loss": 0.573359489440918, }}}
{'org1-a': {'org1-a': {'validation': {"Monai's validation mean dice loss": 0.5733405947685242}}


  • Why is there a reduction in performance in the aggregated server model in the case of CLARA v4.0 when compared to CLARA v4.0-EA2

  • Is there a difference between the way aggregator is implemented by these two versions?

  • Although an assumption, perhaps in EAv2 the best FL model which then corresponds to the global checkpoint that achieved the highest average validation scores across all clients is selected by default? Kindly clarify

Note: In both versions, the num_rounds and local_epoch is the same, i.e. the hyperparameters remain the same.

The “FL_global_model” is the training global model for the current round. It may not hold the best global model. The “best_FL_global_model” holds the best global model based on the validation results from the clients validations for each round, if using the server model selection.

On the client sided, each client keeps its own local best model through the whole FL training.

FL is not fully deterministic. These differences might be due to chance. Longer training might give better results. We would need to look at the convergence curves of each client to see if that’s the case though. I would also recommend using a learning rate decay to achieve more consistent results.

Thank you for your comment hroth3hm8y. Given that our model works on a fixed seed and is deterministic, can we confirm that the non-determinism simply comes from the FL end and if so, which part of FL results in this non-determinism.