TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck

Got it. Thanks for the finding. We will check this parameter.

I try to include the “wandb” information. Good login to the platform, but at the beggining of the train is the same result.
Any problem with the “tensorboard” inside the pod? is installed?

Reading the instructions of the “wandb” use, the docker need to be launch with some parametrization.

Here include the kubectl get pods the-training-pod-name -o yaml > training.yaml information of the POD using wandb:
training.yaml (5.2 KB)

And this is the information included in the Train spec:

        "visualizer": {
            "enabled": true,
            "wandb_config": {
                "entity": "alejandro-granda",
                "name": "TAO_test",
                "notes": "short description of experiment",
                "project": "TAO_test",
                "tags": "training_Test_wandb"

And this is the log of the POD:
dd219215-f027-46cf-95cf-db300331633f.txt (141.7 KB)

Thanks for the yaml file.

Should be the same issue. We will also check it.

1 Like

Could you help to check in debug pod if issue happens when this parameter ON?
Thanks.

The issue is reproduced,
the same as the API.

Attach config used:

  visualizer {
    enabled: True
    infrequent_logging_frequency: 5
    num_images: 3
  }

Attach log:
log_tao5_train_visualizer.txt (129.7 KB)

Could you please share the training spec file? Thanks.

detectnet_v2_train_peoplenet_kitti_multi_kubernetes.txt (9.2 KB)

Thanks.

1 Like

I use a machine(2gpus) but still cannot reproduce this issue with the debug-pod. I share my experiment file.
api_notebook_spec.txt (8.7 KB)

For your case, could you please check if below steps fix the hang issue?
Please run in the debug pod. The yaml file is the same as previous. I just paste it here again.

$ cat debug.yaml
apiVersion: v1
kind: Pod
metadata:
  name: debug
spec:
  restartPolicy: OnFailure
  containers:
  - name: "detectnetv2"
    image: "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5"
    command: ["sleep"]
    args: ["3600000"]
    resources:
      limits:
         nvidia.com/gpu: 2
    volumeMounts:
    - name: "my-workdir"
      mountPath: "/my-workdir"
  volumes:
  - name: my-workdir
    hostPath:
      path: /localhome/local-morganh/

Then, trigger debug pod.
$ kubectl apply -f debug.yaml

Then enter the pod.

$ kubectl exec -it debug -- /bin/bash

Step:

  1. Please make sure you reproduce the issue.
  2. Then, backup the code and modify codes as below.

$ cp /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py.bak
$ vim /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py

 651     #if not is_master:
 652     #    visualizer_config.enabled = False

$ cp /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/tfhooks/utils.py /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/tfhooks/utils.py.bak
$ vim /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/tfhooks/utils.py

 86     # Save checkpoints only on master to prevent other workers from corrupting them.
 87     #if distribution.get_distributor().is_master():
 88     if True:
 89         step_counter_hook = tf.estimator.StepCounterHook(
 90             every_n_steps=summary_every_n_steps,
 91             output_dir=checkpoint_dir
 92         )
 93         hooks.append(step_counter_hook)
 94         #Debug
 95         rank=distribution.get_distributor().rank()
 96         checkpoint_dir=checkpoint_dir+"_"+str(rank)
 97         print(">>> create new folder for {} as {}".format(rank, checkpoint_dir))
 98         import os
 99         if not os.path.exists(checkpoint_dir):
100             os.makedirs(checkpoint_dir)
101         for l in listeners:
102             l._checkpoint_dir = checkpoint_dir
103
104         if checkpoint_dir is not None:
105             if listeners is None:
106                 listeners = []

$ detectnet_v2 train -e /my-workdir/experiment_spec.txt -r /my-workdir/result -k key --gpus 2

1 Like

I’m on that.

Can you try with a resnet34 and a custom pretrain network (in my case peoplenet)?

In all the initial problems that we had, all were related about the difference in the format of th base pretrain network.

When have the results, ping you.

Same result with both. Without and with modification:

Confirm the failure:
tao5_visualizer_confirm_failure.log (131.5 KB)

Added modifications:
tao5_visualizer_modifications.log (138.6 KB)

Files in the experiment_dir

0_
image

1_
image

Thanks for your result. I will check .tlt pretrained model as well.

On your side, could you use fewer dataset to run? Because you are using total dataset size 47494 as training images. To narrow down, could you use few tfrecords files to train?

For example, if you have

$ ls 30d6a2e1-7b13-48c1-b25f-efb8e240a823/tfrecords/
tfrecords-fold-000-of-002-shard-00000-of-00010  
tfrecords-fold-000-of-002-shard-00001-of-00010  
tfrecords-fold-000-of-002-shard-00002-of-00010  
tfrecords-fold-000-of-002-shard-00003-of-00010
tfrecords-fold-000-of-002-shard-00004-of-00010  
tfrecords-fold-000-of-002-shard-00005-of-00010  

Then please backup this folder. And keep only two tfrecord files.

$ ls 30d6a2e1-7b13-48c1-b25f-efb8e240a823/tfrecords/
tfrecords-fold-000-of-002-shard-00000-of-00010  
tfrecords-fold-000-of-002-shard-00001-of-00010  

Also, set the same in validation_data_source.
For example,

  data_sources {
    tfrecords_path: "/30d6a2e1-7b13-48c1-b25f-efb8e240a823/tfrecords/*"
    image_directory_path: "/30d6a2e1-7b13-48c1-b25f-efb8e240a823/"
  }
  validation_data_source {
   tfrecords_path: "/30d6a2e1-7b13-48c1-b25f-efb8e240a823/tfrecords/*"
    image_directory_path: "/30d6a2e1-7b13-48c1-b25f-efb8e240a823/"
  }

More, please set lower bs: batch_size_per_gpu: 4

And what is the resolution in your training images? Are they all the same resolution? I see in your training spec file, enable_auto_resize: true.

Last, if issue still happens, please wait for a moment to check if it can resume. Thanks.

I’ll try it when return from a brief rest.

So, when insert all the dataset can’t use the automl?

No, the experiment is just to narrow down if the dataset size will result in the issue when visualizer is enabled.
We need to figure out the gap between us since I cannot reproduce the issue yet.

I just try to use the same tlt model as yours, not reproduced as well.

model_config {
  pretrained_model_file: "/my-workdir/resnet34_peoplenet.tlt"
  num_layers: 34
  freeze_blocks: [0]
  use_batch_norm: true
  objective_set {
    bbox {
      scale: 35.0
      offset: 0.5
    }
    cov {
    }
  }
  arch: "resnet"
}

Is it possible to share the minimum dataset which can reproduce on your side?

1 Like

Also, where are the training dataset and validation dataset? Do they locate locally or a remote machine? If mount in a remote machine, could you please put them locally instead and retry? We can narrow down in terms of network’s bandwidth or speed.

To narrow down further, we are going to save loss only in tensorboard. Please help run below test.

Please run in the debug pod. The yaml file is the same as previous. I just paste it here again.

$ cat debug.yaml
apiVersion: v1
kind: Pod
metadata:
  name: debug
spec:
  restartPolicy: OnFailure
  containers:
  - name: "detectnetv2"
    image: "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5"
    command: ["sleep"]
    args: ["3600000"]
    resources:
      limits:
         nvidia.com/gpu: 2
    volumeMounts:
    - name: "my-workdir"
      mountPath: "/my-workdir"
  volumes:
  - name: my-workdir
    hostPath:
      path: /localhome/local-morganh/

Then, trigger debug pod.
$ kubectl apply -f debug.yaml

Then enter the pod.

$ kubectl exec -it debug -- /bin/bash

Step:

  1. Please make sure you reproduce the issue.
  2. Backup the code
cp /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py  /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py.bak
cp /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/model/detectnet_model.py /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/model/detectnet_model.py.bak
cp /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/tfhooks/validation_hook.py /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/tfhooks/validation_hook.py.bak
cp /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/tfhooks/utils.py /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/tfhooks/utils.py.bak
  1. Replace the code. Please use below files.
    detectnet_model.py (37.6 KB)
    train.py (49.3 KB)
    utils.py (7.5 KB)
    validation_hook.py (5.2 KB)

  2. Make sure below in training spec file.
    num_images in visualizer should be 0 .
    checkpoint_interval should be same as num_epochs in training_config

  3. Run training.
    $ detectnet_v2 train -e /my-workdir/experiment_spec.txt -r /my-workdir/result -k key --gpus 2

I’m back. Thanks for your answers, I will try to test all the steps in the daytime.

Yes, I always do a resize augmentation over all dataset to avoid this kind of problems.

Are in the NFS folder structure, but in a local place.

To avoid do so much changes in the dataset. I started with the original dataset and implemente the suggested changes. The results are the same. Attach log.

tao5_visualizer_modificated_files.log (134.7 KB)

I will try to continue with the before purposed steps to reduce the dataset and the Batch size.

Try the last suggestions in diferent steps:
Modified files + Low Batch:
tao5_visualizer_modificated_files_low_batch.log (129.0 KB)

Modified files + Lowbatch + reduced tfrecods (only 2 in the evaluation and 2 for training)
tao5_visualizer_modificated_files_low_batch_reducedTfrecords.log (141.9 KB)

Modified files + Lowbatch + reduced tfrecods + No auto resize:
tao5_visualizer_modificated_files_low_batch_reducedTfrecords_NoAutoResize.log (126.6 KB)

Thanks for your detailed info.
For above setting, could you run without amp ?
i.e.,

root@debug:/workspace# detectnet_v2 train -e /workspace/tao-experiments/specs/detectnet_v2_train_peoplenet_kitti_multi_kubernetes.txt -r /workspace/tao-experiments/multilabel_test_TAO5/experiment_dir_unpruned_visualizer_lowbatch_reducedTfrecord_NoAutoResize/ -n resnet34_detector --gpus 2 -k tlt_encode