Creating Robust and Generalizable AI Models with NVIDIA FLARE

Originally published at: Creating Robust and Generalizable AI Models with NVIDIA FLARE | NVIDIA Technical Blog

NVIDIA FLARE v2.0 is an open-source federated learning SDK that is making it easier for data scientists to collaborate to develop more generalizable robust AI models by just sharing model weights rather than private data.

Hello everyone, I have recently came across NVIDIA FLARE framework, and I like it. Unfortunately, when I was trying to simulate an experiment with NVIDIA FLARE on kubernetes (k8s), I faced “CrashLoopBackOff” error from k8s. Also, I could not find any promising results on google.

Additional Info: I was trying this on my private cluster.

Also, I suspect that “CrashLoopBackoff” is due to the fl process was in sleeping state instead of running state. To be confident, I have deployed a hello world flask app along with fl_server using the same pod and by exposing an extra port. This time, there is no “CrashLoopBackoff” error, and I deduce that this is because there is at least one process (Flask App) which is in running state. Similarly, I have followed the same workaround for the client pods too, and it worked. Now, the clients authenticated to server, but I could not perform federation using FLAdminAPI. upload_app method returned “{“details”:{“message”:“Exception: Failed to communicate with Admin Server admin on 3002: [Errno 2] No such file or directory”}” response. On top of this, all this has been merely trail and error but could not find any promising resources that explain how to use Nvidia FLARE with k8s, so I am hoping for some resources.