Hello all,
We are using CLARA Train v4.0-EA1 for provisioning packages for client, server & admin. We have generated the provisioning package from /opt/nvidia/medical/tools/packages folder
Server has been started successfully as shown below
root@flc1:/workspace/startup# ./start.sh
WORKSPACE set to /workspace/startup/..
root@flc1:/workspace/startup# WORKSPACE set to /workspace/startup/..
2021-02-03 23:59:16,088 - torch.distributed.nn.jit.instantiator - INFO - Created a temporary directory at /tmp/tmped5vzdki
2021-02-03 23:59:16,088 - torch.distributed.nn.jit.instantiator - INFO - Writing /tmp/tmped5vzdki/_remote_module_non_sriptable.py
Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to /root/.cache/torch/hub/checkpoints/resnet50-19c8e357.pth
100%|██████████| 97.8M/97.8M [00:03<00:00, 29.6MB/s]
2021-02-03 23:59:20,797 - BaseServer - INFO - Round time: 1612396760 second(s).
2021-02-03 23:59:20,803 - BaseServer - INFO - starting secure server at flc1:8002
deployed FL server trainer.
Starting Admin Server flc1 on Port 8003
Server has been started.
Issue #1:
Client fails certificate verification
siddharth@FLS-1:/workspace/startup$ ./start.sh
WORKSPACE set to /workspace/startup/..
siddharth@FLS-1:/workspace/startup$ WORKSPACE set to /workspace/startup/..
PYTHONPATH is /local/custom::/opt/nvidia/medical
start fl because of no pid.fl
new pid 363
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-nkdfbv_a because the default path (/home/siddharth/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2021-02-04 00:07:40,158 - torch.distributed.nn.jit.instantiator - INFO - Created a temporary directory at /tmp/tmpd1fa8b53
2021-02-04 00:07:40,159 - torch.distributed.nn.jit.instantiator - INFO - Writing /tmp/tmpd1fa8b53/_remote_module_non_sriptable.py
2021-02-04 00:07:40,205 - ClientModelManager - INFO - privacy module disabled.
E0204 00:07:40.231634660 411 ssl_transport_security.cc:1439] Handshake failed with fatal error SSL_ERROR_SSL: error:1000007d:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED.
2021-02-04 00:07:40,232 - Communicator - INFO - Action: client_registration grpc communication error. retry: 1500, First start till now: 0.015082359313964844 seconds.
Could not connect to server: flc1:8002 Setting flag for stopping training. failed to connect to all addresses
2021-02-04 00:07:40,232 - Communicator - INFO - <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1612397260.231747538","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4089,"referenced_errors":[{"created":"@1612397260.231742662","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":393,"grpc_status":14}]}"
Issue #2:
Admin connectivity issue:
clara_hci-4.0.0-py3-none-any.whl client.crt client.key docker.sh fl_admin.sh readme.txt rootCA.pem signature.pkl
root@flc1:/workspace/startup# ./fl_admin.sh
Admin Server: flc1 on port 8003
User Name: admin@nvidia.com
Communication Error - please try later
docker.sh uses the host network:
#!/usr/bin/env bash
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
# docker run script for FL admin
# to use host network, use line below
NETARG="--net=host"
# Admin clients do not need to open ports, so the following line is not needed.
#NETARG="-p 8003:8003"
DOCKER_IMAGE=nvcr.io/ea-nvidia-clara-train/clara-train-sdk:v4.0-EA1
echo "Starting docker with $DOCKER_IMAGE"
docker run --rm -it --name=fladmin -v $DIR/..:/workspace/ -w /workspace/ --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 $NETARG $DOCKER_IMAGE /bin/bash
Kindly suggest how to resolve both issues.
Thanks,
Siddharth