Federated learning - Client training fails due to CUDA_ERROR_OUT_OF_MEMORY: out of memory

Hi all,

We’ve utilised the following MMAR from NGC to setup CLARA FL 3.1 with 1 server and 2 clients

(https://ngc.nvidia.com/catalog/models/nvidia:med:clara_mri_fed_learning_seg_brain_tumors_br16_t1c2tc_no_amp)

Admin - Sequence of events:

> check_status server
FL run number has not been set.
FL server status: training not started
Registered clients: 2 
------------------------------------------------------------------------
| CLIENT NAME | TOKEN                                | SUBMITTED MODEL |
------------------------------------------------------------------------
| org1-a      | 53ec6b98-fb45-4a3a-ad41-e3231c95474c |                 |
| org1-b      | e5888563-e087-40d8-b0e6-223a69f0d7e9 |                 |
------------------------------------------------------------------------
Done [8611 usecs] 2020-12-15 05:38:52.741508

> check_status client
instance:org1-a : client name: org1-a	token: 53ec6b98-fb45-4a3a-ad41-e3231c95474c	status: training not started
instance:org1-b : client name: org1-b	token: e5888563-e087-40d8-b0e6-223a69f0d7e9	status: training not started

> upload_folder ../../../adminMMAR

Created folder /workspace/startup/../transfer/adminMMAR

Done [1317294 usecs] 2020-12-15 05:38:54.651317

> set_run_number 16

Create a new run folder: run_16

Done [5730 usecs] 2020-12-15 05:38:54.670892

> deploy adminMMAR server

mmar_server has been deployed.

Done [25831 usecs] 2020-12-15 05:38:54.697064

> deploy adminMMAR client

instance:org1-a : MMAR deployed.

instance:org1-b : MMAR deployed.

Done [620151 usecs] 2020-12-15 05:38:55.317637

> start server

Server training is starting....

Done [10200 usecs] 2020-12-15 05:38:56.651029

> start client

instance:org1-a : Start the client...

instance:org1-b : Start the client...

Server - Sequence of events:

* 2 client have joined the session
* Server Training has been started
* Get Model is requesting models from each client

Client - Sequence of events:

On running the Epoch 1/5:

Epoch: 1/5, mean_dice_tc: 0.0000 val_time: 0.00s

SupervisedFitter - INFO - New best val metric: 0
SupervisedFitter - INFO - Saving model checkpoint at: /workspace/startup/../run_16/mmar_org1-b/models/model.ckpt
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
 I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.11
 I tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to allocate 4.71G (5063421440 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
 I tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to allocate 4.24G (4557079040 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

config_fed_server.json

{
    "servers": [
        {
            "name": "brats_segmentation",
            "service": {
                "target": "localhost:8002",
                "options": [
                    ["grpc.max_send_message_length",    1000000000],
                    ["grpc.max_receive_message_length", 1000000000]
                ]
            },
            "ssl_private_key": "resources/certs/server.key",
            "ssl_cert": "resources/certs/server.crt",
            "ssl_root_cert": "resources/certs/rootCA.pem",
            "min_num_clients": 2,
            "max_num_clients": 100,
            "wait_after_min_clients": 10,
            "heart_beat_timeout": 600,
            "start_round": 0,
            "num_rounds": 2,
            "exclude_vars": "dummy",
            "num_server_workers": 100,
            "validate_on_server": false,
            "num_rounds_per_valid": 1,
          "compression": "Gzip"
        }
    ],
    "aggregator":
      {
        "name": "ModelAggregator",
        "args": {
          "exclude_vars": "dummy",
          "aggregation_weights":
              {
                "client0": 1,
                "client1": 1.5,
                "client2": 0.8
              }
        }
      },
    "pre_processors": [
      {
        "name": "ModelEncryptor",
        "args": {}
      },
      {
        "name": "DataCompressor",
        "args": {}
      }
    ],
    "post_processors": [
      {
        "name": "DataDeCompressor",
        "args": {}
      },
      {
        "name": "ModelDecryptor",
        "args": {}
      }
    ],
    "model_saver":
    {
        "name": "TFModelSaver",
        "args": {
          "exclude_vars": "dummy"
        }
    }
}

Following this behaviour we have the following questions:

  1. What changes have to be made on the config_fed_server.json, as the one provided by default causes client dropout?

We swapped in the spleen model config_fed_server configuration file (which was used as a demo case in the documentation) which did solve the “client dropout” issue. However the brain model configuration file caused CUDA memory outage on 2 16G GPUs, probably because the pre- and post-processing modules specified in the spleen model were not meant to be used in the brain tumor model. As the brain tumor model is closer to our use case, it’d be good to know how its MMAR should be modified to resolve the client dropout issue.

  1. Is there any documentation on all the pre- and post- processing modules that users can include in the MMAR configuration? For examples in this document, where can we find details of the ModelEncryptor, TrainingCommandModule, ValidateResultProcessor, and potentially other modules for transforming and scaling the input?

  2. Is there a way to do GPU memory management with CLARA FL, e.g. with Tensorflow we can constrain the training for inferencing to only use a certain percentage of the GPU RAM capacity?

  3. After each training run, does CLARA FL store both the “locally trained model” and the “globally improved model” (i.e. after aggregation of model weights)? If yes, which files are they?

Thank you for your help and assistance.

Best Regards,
Siddharth

1 Like
  1. What changes have to be made on the config_fed_server.json, as the one provided by default causes client dropout?

It looks like you are running the 2 clients on the same GPU, which caused the OOM error. They can set the CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES env to control which client running on which GPU.

  1. Is there any documentation on all the pre- and post- processing modules that users can include in the MMAR configuration? For examples in this document , where can we find details of the ModelEncryptor, TrainingCommandModule, ValidateResultProcessor, and potentially other modules for transforming and scaling the input?

Here’s the DataProcessor API. All the processors just need to implement this API, and use it as BYOC.

class DataProcessor(object):

def process(self, data_ctx, app_ctx):
    """
    Called to perform data processing
    :param data_ctx: the context that contains the data points transformed/generated
    :param app_ctx: the overall app context (e.g. current phase, round, task, ...)
    :return:
    """
    pass

def abort(self, app_ctx):
   """
   Called to abort its work immediately
   :param app_ctx:
   :return:
   """
   pass
  1. Is there a way to do GPU memory management with CLARA FL, e.g. with Tensorflow we can constrain the training for inferencing to only use a certain percentage of the GPU RAM capacity?

We don’t have a way to limit certain percentage of GPU memory can be used by a certain FL client. However, use the CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES can choose which client running on which GPU.

1 Like