Federated learning - Client unable to send model to server


We were able to train the client successfully, i.e. the training for the client went through for all epochs. However, on completion of training on the client’s end, the client is unable to push the model to server.

Send model to server.
2021-04-12 07:35:23,680 - FederatedClient - INFO - Starting to push model.
setup local_model_dict
local_model_dict 48
2021-04-12 07:35:23,729 - Communicator - INFO - Send example_project at round 0
2021-04-12 07:35:23,749 - Communicator - INFO - Action: submitUpdate grpc communication error. retry: 6, First start till now: 0.020232439041137695 seconds.
Could not connect to server: flc1:8002 Setting flag for stopping training. failed to connect to all addresses
2021-04-12 07:35:23,750 - Communicator - INFO - <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = “failed to connect to all addresses”
debug_error_string = “{“created”:”@1618212923.749492913",“description”:“Failed to pick subchannel”,“file”:“src/core/ext/filters/client_channel/client_channel.cc”,“file_line”:4089,“referenced_errors”:[{“created”:"@1618212923.749488888",“description”:“failed to connect to all addresses”,“file”:“src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc”,“file_line”:393,“grpc_status”:14}]}"

Seems like the server times out. On decreasing the number of epoch the client was able to share the model with the server.

  • Can we increase the server timeout value such that for any given number of epoch it will always be alive?

  • The client tries to find model.pt in the client run folder

details = “Exception calling application: [Errno 2] No such file or directory: ‘/workspace/startup/…/run_102/mmar_org1-a/models/model.pt’”

I do see that the a model checkpoint final_model.pt has been created in the client run. How do we avoid the above exception such that final_model.pt is utilised?

Kindly suggest on the same.