Federated Learning - Training for Sample App Crashing

Hi Team,

Thank You for providing this forum.

Our team has been trying Federated learning framework provided by Nvidia. We are trying to deploy sample notebook provided on spleen segmentation. We are using GCP compute instance. However, our training is crashing after 1 federated round. We are using instance with 2 GPUs and n1-standard-32 machine type. GPUS are NVIDIA Tesla P4. We have following questions regarding that -

  • Is this a common problem?

  • Is there a machine type or GPU size that you would recommend?

We have been struggling since a week. Your response will be highly appreciated.

Thank You

Hi
Thanks for your interest in Clara Train SDK.

Could you share with us the error and crash to better help you ? I assume you have seem the GTC videos for clara train FL and note books. if not please see links below.

For FL you would need at least 2 clients each with a gpu with Compute Capability > 6.0 and higher. The P4 is good enough. Please see details here. The server doesn’t require any GPU.

GTC 2020 Digital talks about the Clara Train SDK

  • S22563 Clara train Getting started: Core concepts, Bring your own components (BYOC), AI assisted annotation (AIAA), AutoML
  • S22717 Clara train Performance: Different aspects of acceleration in train V3
  • S22564 Clara Developer Day: Federated Learning using Clara Train SDK

Notebooks

Hope that helps

Hi,

I increased the memory size to “high memory” in my GPU instance and it worked fine. Thank You for your help. Videos were really helpful.