Out of memory replicating spleen segmentation demo

run both server and client on the same instance, currently only run 1 client, client docker killed after out of memory
cpu and ram usage graph can be found here

Thanks for your interest in Clara Train Federated Learning. The memory utilization appears to be abnormally high; would you please share the client.log file (logs/client_0_log.tmp) from this run?

client_0_log.txt (85.7 KB)

Thanks - sorry, would you also share the server log and also perhaps the client MMAR (config/command etc… directory under clara_seg_ct_spleen_fl_demo. Since this directory also has all the logs as well, perhaps send the entire archive?

i left the models folder out because the folder’s size is large (around 1.6GB), the rest should be in the zip file.
clara_seg_ct_spleen_fl_demo.zip (55.8 KB)

  1. I guess that smart catch causes the problem on out of memory. https://docs.nvidia.com/clara/tlt-mi/clara-train-sdk-v3.0/nvmidl/additional_features/smart_cache.html
  2. Please try to turn off the AMP feature during the training. while large training data> small batch size>lr large then computing value will be unstable andunderflow or overflow == NaN or INF。

We solved the problem of NaN/INF error from turning off AMP. On the other hand about out of memory issue, it seems like my config file doesn’t use smart cache for training.

Hello Phuchit,

The out of memory issue is due to gRPC’s allocation to handle the number of concurrent communication with the client/worker nodes. Once the communication is complete, gRPC doesn’t release the memory and this results in excessive memory utilization. To that end, would you please try reducing the “num_server_workers” to 5 or so? I believe it’s currently set to 100.

Hope this helps.

1 Like

Reducing num_server_workers seems to help with the amount of memory used but memory usage still increases every federated learning round, but clearly lower than before when num_server_workers is set to 100. We can run federated learning with more federated learning now, thanks for all the help.