Clients left after the first round

Hi all,
I was trying to run the federated learning repo clara_mri_fed_learning_seg_brain_tumors_br16_t1c2tc_no_amp, with 2 clients. But it turns out that it never aggregated the local models as the training always stop when the first round ends.
How can I fix it?

Thanks in advance.

The info from the server is like:

2020-10-27 16:58:03,020 - ServerModelManager - INFO - CLEAN START (global_variables_initializer)
2020-10-27 16:58:05,405 - FederatedServer - INFO - Round time: less than a second(s).
2020-10-27 16:58:05,412 - FederatedServer - INFO - starting secure server at fedserver:8008
2020-10-27 16:58:33,281 - FederatedServer - INFO - Client: New client @192.168.211.17 joined. Sent token: 184daf2d-88d0-468d-abb2-32f33380b01a. Total clients: 1
2020-10-27 16:58:33,283 - FederatedServer - INFO - Client: New client @192.168.211.20 joined. Sent token: 6e8c0dcc-b366-4b20-9143-f359e9418855. Total clients: 2
2020-10-27 16:59:27,346 - FederatedServer - INFO - Client: 184daf2d-88d0-468d-abb2-32f33380b01a left. Total clients: 1
2020-10-27 16:59:27,347 - FederatedServer - INFO - Client: 6e8c0dcc-b366-4b20-9143-f359e9418855 left. Total clients: 0

The info from clients are like:
Epoch: 1/1, Iter: 120/122 [=================== ] train_dice_tc: 0.1917 train_loss: 0.8694 time: 1.11s
Epoch: 1/1, Iter: 121/122 [=================== ] train_dice_tc: 0.1963 train_loss: 0.8661 time: 1.11s
Epoch: 1/1, Iter: 122/122 [====================] train_dice_tc: 0.1970 train_loss: 0.8659 time: 1.11s
This epoch: 158.04s; per epoch: 158.04s; elapsed: 158.04s; remaining: 0.00s; best metric: 0.029600229114294052 at epoch 0
Saved final model checkpoint at: /workspace/fed/commands/…/models/model_final.ckpt
Total time for fitting: 177.27s
Best validation metric: 0.029600229114294052 at epoch 0
2020-10-27 17:04:48,767 - FederatedClient - INFO - Shutting down client
2020-10-27 17:04:48,769 - FederatedClient - INFO - Quitting server: brats_segmentation
2020-10-27 17:04:58,787 - FederatedClient - INFO - Received comment from server: Removed client
2020-10-27 17:04:58,850 - main - INFO - Total Training Time 205.05036044120789
r

The config_fed_server.json:
{
“servers”: [
{
“name”: “brats_segmentation”,
“service”: {
“target”: “fedserver:8008”,
“options”: [
[“grpc.max_send_message_length”, 1000000000],
[“grpc.max_receive_message_length”, 1000000000]
]
},
“ssl_private_key”: “resources/certs/server.key”,
“ssl_cert”: “resources/certs/server.crt”,
“ssl_root_cert”: “resources/certs/rootCA.pem”,
“min_num_clients”: 2,
“max_num_clients”: 5,
“wait_after_min_clients”: 10,
“heart_beat_timeout”: 600,
“start_round”: 0,
“num_rounds”: 100,
“exclude_vars”: “dummy”,
“num_server_workers”: 5
}
],
“aggregator”: {
“name”: “ModelAggregator”,
“args”: {
“exclude_vars”: “dummy”,
“aggregation_weights”: {
“client0”: 1,
“client1”: 1.5,
“client2”: 0.8
}
}
}

}

The config_fed_client1.json:
{
“servers”: [
{
“name”: “brats_segmentation”,
“service”: {
“target”: “fedserver:8008”,
“options”: [
[“grpc.max_send_message_length”, 1000000000],
[“grpc.max_receive_message_length”, 1000000000]
]
}
}
],
“client”: {
“local_epochs”: 1,
“steps_aggregation”: 10,
“exclude_vars”: “dummy”,
“privacy”: {
“name”: “PercentileProtocol”,
“dp_type”: “none”,
“args”:{
“percentile”: 75,
“gamma”: 1
}
},
“retry_timeout”: 300,
“ssl_private_key”: “resources/certs/client1.key”,
“ssl_cert”: “resources/certs/client1.crt”,
“ssl_root_cert”: “resources/certs/rootCA.pem”
}
}

Hi
Thanks for your interest in Clara Train SDK. I am glad you managed to get FL up with 2 clients.

It looks like you have set the epochs to 100 but it is not respected. I also see you are missing num_rounds_per_valid
I would change it to something as

            "start_round": 0,
            "num_rounds": 2,
            "num_rounds_per_valid": 1,   <---- add this 

Also on the client could you try changing “local_epochs”: 5 to see if client and server are reading these files.

On the bright side we are about to release clara train V3.1 with multiple sample notebooks that simplifies the FL workflow. Please stay tuned for this release as it was mainly focuses on FL and we would love to hear your feedback

Hi,
Thank you for answering so soon. I tried add “num_rounds_per_valid” and change the “local_epochs” to 5, but the training still terminated right after the first round ended.
Can anyone kindly help me to fix it?

Thanks you…

The logs/configs are as following:

server_train.sh:

#!/usr/bin/env bash

my_dir="$(dirname “$0”)"
. $my_dir/set_env.sh
echo “MMAR_ROOT set to $MMAR_ROOT”

CONFIG_FILE=config/config_train.json
SERVER_FILE=config/config_fed_server.json
ENVIRONMENT_FILE=config/env_server.json

TF_ENABLE_AUTO_MIXED_PRECISION=0
python3 -u -m nvmidl.apps.fed_learn.server.fed_aggregate
-m $MMAR_ROOT
-c $CONFIG_FILE
-e $ENVIRONMENT_FILE
-s $SERVER_FILE
–set
secure_train=true

Logs on the server:

2020-10-30 16:42:07,878 - ServerModelManager - INFO - CLEAN START (global_variables_initializer)
2020-10-30 16:42:10,041 - FederatedServer - INFO - Round time: less than a second(s).
2020-10-30 16:42:10,048 - FederatedServer - INFO - starting secure server at fedserver:8008
2020-10-30 16:42:32,758 - FederatedServer - INFO - Client: New client @192.168.211.17 joined. Sent token: 9b7655f7-3e9b-4109-9534-f5978d2dab12. Total clients: 1
2020-10-30 16:43:20,067 - FederatedServer - INFO - Client: New client @192.168.211.17 joined. Sent token: 800245e0-6d97-482d-8c4e-614730d11911. Total clients: 2
2020-10-30 16:43:46,987 - FederatedServer - INFO - Client: 9b7655f7-3e9b-4109-9534-f5978d2dab12 left. Total clients: 1
2020-10-30 16:44:33,765 - FederatedServer - INFO - Client: 800245e0-6d97-482d-8c4e-614730d11911 left. Total clients: 0

Logs on the clients:

Requested train iterations: 122
2020-10-30 16:42:32,758 - FederatedClient - INFO - Successfully registered client: for brats_segmentation. Got token:9b7655f7-3e9b-4109-9534-f5978d2dab12
2020-10-30 16:42:33,139 - FederatedClient - INFO - Received brats_segmentation model at round 0 (18808374 Bytes)
2020-10-30 16:42:33,527 - AssignVariables - INFO - Vars from remote 83, Vars from local 252, vars matched 83 of 252 local
2020-10-30 16:42:33,531 - ClientModelManager - INFO - Setting graph with global federated model data (4700914 elements)
2020-10-30 16:42:33,532 - ClientModelManager - INFO - Round 0: local model updated
2020-10-30 16:42:36.327256: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
Epoch: 1/5, mean_dice_tc: 0.0056 val_time: 17.49s
2020-10-30 16:42:51,025 - SupervisedFitter - INFO - New best val metric: 0.005568110849708319
2020-10-30 16:42:51,025 - SupervisedFitter - INFO - Saving model checkpoint at: /workspace/fed/commands/…/models/model.ckpt
Epoch: 1, Iter: 1/122 [ ] train_dice_tc: 0.0161 train_loss: 0.9530 time: 23.78s
Epoch: 1, Iter: 2/122 [ ] train_dice_tc: 0.0298 train_loss: 0.9546 time: 1.11s
Epoch: 1, Iter: 3/122 [ ] train_dice_tc: 0.0372 train_loss: 0.9510 time: 1.11s
Epoch: 1, Iter: 4/122 [ ] train_dice_tc: 0.0307 train_loss: 0.9606 time: 1.11s
Epoch: 1, Iter: 5/122 [ ] train_dice_tc: 0.0317 train_loss: 0.9604 time: 1.11s
Epoch: 1, Iter: 6/122 [ ] train_dice_tc: 0.0319 train_loss: 0.9608 time: 1.11s
Epoch: 1, Iter: 7/122 [= ] train_dice_tc: 0.0304 train_loss: 0.9637 time: 1.11s
Epoch: 1, Iter: 8/122 [= ] train_dice_tc: 0.0277 train_loss: 0.9672 time: 1.11s
Epoch: 1, Iter: 9/122 [= ] train_dice_tc: 0.0404 train_loss: 0.9546 time: 1.11s
Epoch: 1, Iter: 10/122 [= ] train_dice_tc: 0.0365 train_loss: 0.9591 time: 1.11s
Epoch: 1, Iter: 11/122 [= ] train_dice_tc: 0.0338 train_loss: 0.9619 time: 1.11s
Epoch: 1, Iter: 12/122 [= ] train_dice_tc: 0.0326 train_loss: 0.9632 time: 1.11s
Epoch: 1, Iter: 13/122 [== ] train_dice_tc: 0.0329 train_loss: 0.9629 time: 1.11s
Epoch: 1, Iter: 14/122 [== ] train_dice_tc: 0.0318 train_loss: 0.9646 time: 1.11s
Epoch: 1, Iter: 15/122 [== ] train_dice_tc: 0.0308 train_loss: 0.9655 time: 1.11s
Epoch: 1, Iter: 16/122 [== ] train_dice_tc: 0.0297 train_loss: 0.9666 time: 1.12s
Epoch: 1, Iter: 17/122 [== ] train_dice_tc: 0.0310 train_loss: 0.9647 time: 1.11s
Epoch: 1, Iter: 18/122 [== ] train_dice_tc: 0.0306 train_loss: 0.9648 time: 1.11s
Epoch: 1, Iter: 19/122 [=== ] train_dice_tc: 0.0367 train_loss: 0.9593 time: 1.11s
Epoch: 1, Iter: 20/122 [=== ] train_dice_tc: 0.0381 train_loss: 0.9560 time: 1.11s
2020-10-30 16:43:36,967 - FederatedClient - INFO - Shutting down client
2020-10-30 16:43:36,968 - FederatedClient - INFO - Quitting server: brats_segmentation
2020-10-30 16:43:46,988 - FederatedClient - INFO - Received comment from server: Removed client
2020-10-30 16:43:47,014 - main - INFO - Total Training Time 92.51741528511047

config_train.json:

{
“epochs”: 1250,
“num_training_epoch_per_valid”: 20,
“train_summary_recording_interval”: 10,
“use_scanning_window”: false,
“multi_gpu”: false,
“learning_rate”: 1e-4,
“use_amp”: false,
“train”: {
“loss”: {
“name”: “Dice”,
“args”: {
“squared_pred”: true,
“is_onehot_targets”: false,
“skip_background”: true
}
},
“optimizer”: {
“name”: “Adam”
},
…omitted…

}

config_fed_server.json:

"servers": [
    {
        "name": "brats_segmentation",
        "service": {
            "target": "fedserver:8008",
            "options": [
                ["grpc.max_send_message_length",    1000000000],
                ["grpc.max_receive_message_length", 1000000000]
            ]
        },
        "ssl_private_key": "resources/certs/server.key",
        "ssl_cert": "resources/certs/server.crt",
        "ssl_root_cert": "resources/certs/rootCA.pem",
        "min_num_clients": 2,
        "max_num_clients": 5,
    "wait_after_min_clients": 100,
    "heart_beat_timeout": 600,
        "start_round": 0,
        "num_rounds": 2,
    "num_rounds_per_valid": 1,
        "exclude_vars": "(Adam|beta.*power|global_step)",
        "num_server_workers": 5 
    }
],
    "aggregator": {
    "name": "ModelAggregator",
    "args": {
        "exclude_vars": "dummy",
        "aggregation_weights": {
            "client0": 1,
            "client1": 1.5,
            "client2": 0.8
        }
    }
}

}

The config_fed_client1.json:

"servers": [
    {
        "name": "brats_segmentation",
        "service": {
            "target": "fedserver:8008",
            "options": [
                ["grpc.max_send_message_length",    1000000000],
                ["grpc.max_receive_message_length", 1000000000]
            ]
        }
    }
],
"client": {
    "local_epochs": 5,
    "steps_aggregation": 20,
    "exclude_vars": "dummy",
    "privacy": {
        "name": "PercentileProtocol",
        "dp_type": "none",
        "args":{
        "percentile": 75,
        "gamma": 1
        }
    },
    "retry_timeout": 300,
    "ssl_private_key": "resources/certs/client2.key",
    "ssl_cert": "resources/certs/client2.crt",
    "ssl_root_cert": "resources/certs/rootCA.pem"
}

}

Hi

It seemed when “steps_aggregation” is set to non-zero, it will overwrite the “local_epochs” which didn’t help.
The log shows that the FL client finished the 1st round of training, but didn’t submit the trained model to the server for aggregation as the next step. It went to the shutdown. Unfortunately, the log does not have enough information for further trouble shooting and this is for V3.0

We hope this is fixed in V3.1. If you could wait till the end of next week that is when we would have V3.1 out. Otherwise for V3.0 if you can compare your setup with the one available in the notebooks https://ngc.nvidia.com/catalog/resources/nvidia:med:federated_learning may be there is something different that is causing this.

Please stay tuned for Clara train V3.1

Hi all,
I found that after adding
“data_assembler”: {
“name”: “DataAssembler”
},
in config_fed_client.json, then the problem solved.

However, I notice that the global models received by the 2 clients in the beginning of each round (except round 0) are different. But, in the workflow of federated learning, the weights should be the same in the beginning of each round. Is this normal during the process of federated learning? or my understanding was wrong?

Thanks…