Federated learning - brats segmentation

Hello,

We are trying to implement Federated learning on brats_segmentation model as shown here: https://ngc.nvidia.com/models/ea-nvidia-clara-train:clara_pt_brain_mri_segmentation

The MMAR does not have the config_fed_client.json and config_fed_server.json.

We utilized and adapted these files from Clara Train 3.1 MMARs.
config_fed_server.json (1.5 KB) config_fed_client.json (719 Bytes)

The following error is received on starting server,

Starting Admin Server flc1 on Port 8003
Server has been started.
2021-02-21 16:06:00,581 - ClientManager - INFO - Client: New client org1-a@10.65.199.147 joined. Sent token: 8489aace-7fa6-48a8-bd30-4cf780fa7200.  Total clients: 1
2021-02-21 16:06:19,481 - ClientManager - INFO - Client: New client org1-b@10.65.199.147 joined. Sent token: a71751ee-d132-4df9-a43c-0520bf50b6ce.  Total clients: 2
Check server status.
Error processing config /workspace/startup/../run_20/mmar_server/config/config_train.json: local variable 'trainer' referenced before assignment
Traceback (most recent call last):
  File "server/sai.py", line 368, in start_server_training
  File "utils/wfconf.py", line 163, in configure
  File "utils/wfconf.py", line 158, in configure
  File "utils/wfconf.py", line 154, in _do_configure
  File "apps/fed_learn/fl_conf.py", line 197, in finalize_config
UnboundLocalError: local variable 'trainer' referenced before assignment
FL server execution exception: local variable 'trainer' referenced before assignment
2021-02-21 16:07:53,145 - BaseServer - INFO - Stopping server training...
2021-02-21 16:07:53,146 - ServerModelManager - INFO - closing the model manager
2021-02-21 16:07:53,146 - BaseServer - INFO - Round time: 158 second(s).

Couple of questions:

Q1. What are the valid configurations for config_fed_server.json, config_fed_client.json in case of this brats_segmentation MMAR? How do we resolve the above error?

Q2. How do we decide which pre_processors and post_processors have to be used for a particular MMAR?

Thanks,
Siddharth

Any clue on this?

Hi
Thanks for your interest in clara train SDK.
it seems the error is related to the config_train.json, can you attach your full mmar you are using

Also we recommend you going through the notebook examples found at clara-train-examples/NoteBooks/FL at master · NVIDIA/clara-train-examples · GitHub
This provides an easy set up to test FL in a single docker.

For Q2: pre_processors and post_processors. unless you want to write your own handler you should stick with the out of the box one. Is there anything special you want to do ?

Also Please make sure you are using the latest sdk v3.1.01.

We are using clara-train/clara-train-sdk:v4.0-EA2

config_train.json (6.7 KB)

Issue is still there. Please help.

Receiving the same error on using config_train from clara-train-examples/config_train.json at 7b522adcf1fab9380cd77fbbbe0cc958fa197a0e · NVIDIA/clara-train-examples · GitHub

Hi
So you are using V4EA2 to test FL.
Is the notebooks giving an error out of the box on the spleen example or you changed the config to be the brats train config?
I am trying to replicate on my end but don’t get an error

Hi aharouni,

Thank you for your response.

Seems like this was an intermittent error resolved on regeneration of the provisioning sites.