Client certificate verification handshake failure & admin connectivity

Hello all,

We are using CLARA Train v4.0-EA1 for provisioning packages for client, server & admin. We have generated the provisioning package from /opt/nvidia/medical/tools/packages folder

Server has been started successfully as shown below

root@flc1:/workspace/startup# ./start.sh
WORKSPACE set to /workspace/startup/..
root@flc1:/workspace/startup# WORKSPACE set to /workspace/startup/..
2021-02-03 23:59:16,088 - torch.distributed.nn.jit.instantiator - INFO - Created a temporary directory at /tmp/tmped5vzdki
2021-02-03 23:59:16,088 - torch.distributed.nn.jit.instantiator - INFO - Writing /tmp/tmped5vzdki/_remote_module_non_sriptable.py
Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to /root/.cache/torch/hub/checkpoints/resnet50-19c8e357.pth
100%|██████████| 97.8M/97.8M [00:03<00:00, 29.6MB/s]
2021-02-03 23:59:20,797 - BaseServer - INFO - Round time: 1612396760 second(s).
2021-02-03 23:59:20,803 - BaseServer - INFO - starting secure server at flc1:8002
deployed FL server trainer.
Starting Admin Server flc1 on Port 8003
Server has been started.

Issue #1:
Client fails certificate verification

siddharth@FLS-1:/workspace/startup$ ./start.sh
WORKSPACE set to /workspace/startup/..
siddharth@FLS-1:/workspace/startup$ WORKSPACE set to /workspace/startup/..
PYTHONPATH is /local/custom::/opt/nvidia/medical
start fl because of no pid.fl
new pid 363
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-nkdfbv_a because the default path (/home/siddharth/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2021-02-04 00:07:40,158 - torch.distributed.nn.jit.instantiator - INFO - Created a temporary directory at /tmp/tmpd1fa8b53
2021-02-04 00:07:40,159 - torch.distributed.nn.jit.instantiator - INFO - Writing /tmp/tmpd1fa8b53/_remote_module_non_sriptable.py
2021-02-04 00:07:40,205 - ClientModelManager - INFO - privacy module disabled.
E0204 00:07:40.231634660 411 ssl_transport_security.cc:1439] Handshake failed with fatal error SSL_ERROR_SSL: error:1000007d:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED.
2021-02-04 00:07:40,232 - Communicator - INFO - Action: client_registration grpc communication error. retry: 1500, First start till now: 0.015082359313964844 seconds.
Could not connect to server: flc1:8002 Setting flag for stopping training. failed to connect to all addresses
2021-02-04 00:07:40,232 - Communicator - INFO - <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1612397260.231747538","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4089,"referenced_errors":[{"created":"@1612397260.231742662","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":393,"grpc_status":14}]}"

Issue #2:
Admin connectivity issue:
clara_hci-4.0.0-py3-none-any.whl client.crt client.key docker.sh fl_admin.sh readme.txt rootCA.pem signature.pkl
root@flc1:/workspace/startup# ./fl_admin.sh
Admin Server: flc1 on port 8003
User Name: admin@nvidia.com
Communication Error - please try later

docker.sh uses the host network:

#!/usr/bin/env bash
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
# docker run script for FL admin
# to use host network, use line below
NETARG="--net=host"
# Admin clients do not need to open ports, so the following line is not needed.
#NETARG="-p 8003:8003"
DOCKER_IMAGE=nvcr.io/ea-nvidia-clara-train/clara-train-sdk:v4.0-EA1
echo "Starting docker with $DOCKER_IMAGE"
docker run --rm -it --name=fladmin -v $DIR/..:/workspace/ -w /workspace/ --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 $NETARG $DOCKER_IMAGE /bin/bash

Kindly suggest how to resolve both issues.

Thanks,
Siddharth

You are using “flc1” as the host CN name for the server. Both FL client and Admin clients could not establish the connection to the server. Can you verify if your FL client and Admin machine can reach the FL server machine through the host name “flc1”? Also verify if port 8002 and 8003 are open on the server?

Hi yuhongw,

On starting server and running netstat -tulpn | grep LISTEN

tcp        0      0 10.65.199.142:8003      0.0.0.0:*               LISTEN      -   
tcp6       0      0 10.65.199.142:8002      :::*                    LISTEN      -

Also after starting the server and running telnet from client we get

telnet flc1 8002
Trying 10.65.199.142...
Connected to flc1.

telnet flc1 8003
Trying 10.65.199.142...
Connected to flc1.

It looks like the network and port set up are okay. Do you have “–net=host” in your FL server start docker script? Otherwise , let’s try re-generate the startup kits from the provisioning tool. Because the FL client also complaining SSL certificate handshake error.

Please also check if you can “ping flc1” from the FL server docker. If not, you can add an entry “10.65.199.142 flct1” in the /etc/hosts.

Hi yuhongw,

Yes, --net=host is present on the server script. Hence, we are able to connect to the server as shown.

Please note, flc1 is our server machine. This entry has been added to the /etc/hosts file as below:

/etc/hosts on server
10.65.199.142 flc1

/etc/hosts on client
10.65.199.142 flc1

Pinging flc1 on server (loopback)
ping flc1

PING flc1 (10.65.199.142) 56(84) bytes of data.
64 bytes from flc1 (10.65.199.142): icmp_seq=1 ttl=64 time=0.045 ms
64 bytes from flc1 (10.65.199.142): icmp_seq=2 ttl=64 time=0.022 ms
64 bytes from flc1 (10.65.199.142): icmp_seq=3 ttl=64 time=0.022 ms
64 bytes from flc1 (10.65.199.142): icmp_seq=4 ttl=64 time=0.022 ms

--- flc1 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3072ms

Pinging flc1 on client

ping flc1
PING flc1 (10.65.199.142) 56(84) bytes of data.
64 bytes from AMD-4U-8GPU (10.65.199.142): icmp_seq=1 ttl=64 time=0.284 ms
64 bytes from AMD-4U-8GPU (10.65.199.142): icmp_seq=2 ttl=64 time=0.256 ms
64 bytes from AMD-4U-8GPU (10.65.199.142): icmp_seq=3 ttl=64 time=0.245 ms

--- flc1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2026ms

I will be retrying the provisioning tool and will update if I see any difference. However, the issue doesn’t seem to be with the networking as such as we have tested this setup previously with CLARA FL 3.1.

Hi yuhongw,

Have you got a chance to try this out? Please advice if any thoughts.

Thank you

Hi Siddharth,

Have you tried re-generate the server and client packages using the provisioning tool? What’s your results using the new packages?

Hi yuhongw,

Seems to be working well on regeneration. I believe this was an intermittent error. Thank you for your help!