run both server and client on the same instance, currently only run 1 client, client docker killed after out of memory
cpu and ram usage graph can be found here
Thanks for your interest in Clara Train Federated Learning. The memory utilization appears to be abnormally high; would you please share the client.log file (logs/client_0_log.tmp) from this run?
Thanks - sorry, would you also share the server log and also perhaps the client MMAR (config/command etc… directory under clara_seg_ct_spleen_fl_demo. Since this directory also has all the logs as well, perhaps send the entire archive?
i left the models folder out because the folder’s size is large (around 1.6GB), the rest should be in the zip file.
clara_seg_ct_spleen_fl_demo.zip (55.8 KB)
We solved the problem of NaN/INF error from turning off AMP. On the other hand about out of memory issue, it seems like my config file doesn’t use smart cache for training.
The out of memory issue is due to gRPC’s allocation to handle the number of concurrent communication with the client/worker nodes. Once the communication is complete, gRPC doesn’t release the memory and this results in excessive memory utilization. To that end, would you please try reducing the “num_server_workers” to 5 or so? I believe it’s currently set to 100.
Hope this helps.
Reducing num_server_workers seems to help with the amount of memory used but memory usage still increases every federated learning round, but clearly lower than before when num_server_workers is set to 100. We can run federated learning with more federated learning now, thanks for all the help.