I would like to simulate federated learning using clara train sdk to make some experiments. Because I only have one GPU available, the clients need to share the GPU which means that the individual clients need to train one after the other and each client always needs to free all reserved GPU memory after one train_step is done.
I am not sure whether this is achievable using CLARA Train? Which modifications do I need to make?
fed_client.py or client_model_manager.py look like a promising candidates, but I would love to get some tipps from someone who knows the full code base whether this can work before trying to implement it.
Thank you for any Tipps
Thank you for your interest in Clara Train. The current Federated Learning implementation doesn’t support this setup as the GPU resources don’t get released between training rounds.
That being said, if you have a capable GPU (memory) and perhaps a small network/dataset, then you can share the same GPU.
Thank you very much for your help @aquraini !
I understand that the framework does not provide this capability out of the box, but are you sure that it is not possible to implement on top of it?
I imagine it could be possible by adapting the code so that the clients do a local epoch/step using the full gpu in a round robin fashion and upload all their weights/deltas to the server. After each round the server computes the average and delivers the new model to the clients. In order to make sure that the gpu is only used by one client at the time one would need to
- implement some waiting mechanism
- the gpu has be freed by each client after each local epoch
- each client must restore the model on the GPU once it received the new common model from the server
Looking at the code it is not completely clear why such an approach can not be implemented. Can you maybe elaborate?
Thanks for your followup. However the fundamental issue is that the FL client and fitter do not release the GPU resources between the different rounds of FL training. You are welcome to navigate the code and experiment to see if you can come up with a workable solution. Would love to hear if you’ve had any success.