Hello, everyone,
I have just established a federated learning cluster using Clara SDK 4.0. It runs at a single machine (localhost) and includes one server and two clients (ikang-a & ikang-b). Everything goes well for a lot of experiments. However, suddenly, the admin side can not connect two clients anymore for some unknown reason. I have checked the config files and they are unchanged. It is very strange, because:
- Both of the clients reported they have registered the server successfully.
Successfully registered client:ikang-a for XXX. Got token:9dd0a987-3ad7-4175-b07e-c45fcbe1c7fa
Successfully registered client:ikang-b for XXX. Got token:97c4e45c-a555-43ec-979b-325978c01736
- The server has reported two clients have joined.
New client ikang-a@xxx.xxx.xxx.xxx joined. Sent token: 9dd0a987-3ad7-4175-b07e-c45fcbe1c7fa. Total clients: 1
New client ikang-b@xxx.xxx.xxx.xxx joined. Sent token: 97c4e45c-a555-43ec-979b-325978c01736. Total clients: 2
- “Check_status server” command on the admin side returned normally.
check_status server
FL run number has not been set.
FL server status: training not started
Registered clients: 2
CLIENT NAME | TOKEN | LAST ACCEPTED ROUND | CONTRIBUTION COUNT |
| ikang-a | 9dd0a987-3ad7-4175-b07e-c45fcbe1c7fa | | 0 |
| ikang-b | 97c4e45c-a555-43ec-979b-325978c01736 | | 0 |
- But “check_status client” returned “no replies”
check_status client
instance:ikang-a : No replies
instance:ikang-b : No replies
So starting training on the clients’ side failed.
I have no idea how to fix this problem.
The version of nvflare is: 0.1.4.
By the way, I tried to read the code of the nvflare to locate the problem. But It is pyc format. The current version of nvlare on github is 2.0.0+。Maybe I should upgrade nvflare?
Thanks for your help!
Steven