Clara train federated learning - server launches correctly but client cannot connect

Hi all,

I’m trying to launch a federated training using Clara Train v4 but I cannot establish a server-client connection.

I generate the server and clients packages using the provision tool and the following project.yml file:

project.yml (2.3 KB)

I can launch the server correctly (as shown by the terminal output):

server.txt (441 Bytes)

However, when I try to launch the client and to connect it to the server, it fails:

client_error.txt (938 Bytes)

I launch the client’s and the server’s dockers with the following files:

server_docker.txt (583 Bytes)
client_docker.txt (753 Bytes)

Do you have any clues on could may be going wrong ?

Thanks in advance,

Gonzalo Quintana

Hi

It seems you cannot connect / reach MyServer. I see your docker files map the host network. Is MyServer the name of your machine ? I guess not so FL doesn’t know how to resolve this name. you could add it to the /etc/hosts
as

<123.32.2.4 yourRealip Not 127.0.0.1>  MyServer

However before doing that I strongly recommend you go over all the FL notebooks that would get you started with FL from having clients and server in the same docker all the way to have clients and servers on different machines

All FL notebooks are at clara-train-examples/PyTorch/NoteBooks/FL at master · NVIDIA/clara-train-examples · GitHub

Hope this helps