TAO action recogniton net trainning extremely slow

I am trying to finetune a pre-trained actionrecognitionet model on my custom dataset through TAO. However, fine-tuning process is extremely slow( >30mins per epoch). It was also observed that the GPU utilization is not constant and idles to 0 frequently. I would greatly appreciate any insights, suggestions, or assistance that the community can provide regarding this matter. Thank you in advanced for your time and support.

• Hardware - GPU : 24GBs RTX 3090, CPU: intel i9, Ram - 64GBs
• Network Type - ActionRecognitionNet
I am launching the training via the following command

docker run --rm --runtime=nvidia --gpus all --ipc=host --privileged --ulimit memlock=-1 --ulimit stack=67108864 -v /mnt/sdb/harry_files/share/actionRecognitionNet:/shared_volume -v /mnt/sdb/harry_files/share/datasets/tao_associates/data:/dataset 610ffeb5262c  action_recognition train -e /shared_volume/specs/finetune_mbd_speedtest.yaml  -r /shared_volume/results/a101_exp13 -k nvidia_tao model_config.rgb_pretrained_model_path=/shared_volume/models/pretrained/resnet18_2d_rgb_hmdb5_32.tlt model_config.rgb_pretrained_num_classes=5

• Training spec file -

output_dir: /shared_volume/results/a101_exp11
encryption_key: nvidia_tao
model_config:
  model_type: rgb
  backbone: resnet18
  rgb_seq_length: 32
  input_type: 2d
  sample_strategy: consecutive
  dropout_ratio: 0.0
train_config:
  optim:
    lr: 0.001
    momentum: 0.9
    weight_decay: 0.0001
    lr_scheduler: MultiStep
    lr_steps: [5, 15, 20]
    lr_decay: 0.1
  epochs: 100
  checkpoint_interval: 5
dataset_config:
  train_dataset_dir: /dataset/train
  val_dataset_dir: /dataset/val
  label_map:
    attach_cabin: 0
    attach_wheel: 1
    screw_chassis: 2
  output_shape:
  - 224
  - 224
  batch_size: 32
  workers: 8
  clips_per_video: 15
  augmentation_config:
    train_crop_type: no_crop
    horizontal_flip_prob: 0.5
    rgb_input_mean: [0.5]
    rgb_input_std: [0.5]
    val_center_crop: False

Did you ever try official notebook to check if there is the same issue?

Hi Morganh

Earlier, I tried fine-tuning a model using the steps outlined in the official demo provided by NVIDIA (Developing and Deploying Your Custom Action Recognition Application Without Any AI Expertise Using NVIDIA TAO and NVIDIA DeepStream | NVIDIA Technical Blog). However, instead of executing the cells within the official notebook, I directly invoked the containers as described in this guide (Working With the Containers - NVIDIA Docs).

The demo finetuning was significantly faster, completed the run within minutes and there were no issues Please let me know if there are any other details that you want me to share. Thank you in advanced.

Could you increase rgb_seq_length and batch_size
For example,
rgb_seq_length: 64
batch_size: 64

More, to increase workers as well. For example,
workers: 32

I have increased rgb_seq_length to 64, batch_size to 64, and workers to 32. However, there is still no improvement in the training speed and remains significantly slow. Do you think the data preprocessing/loaders within TAO is causing a bottleneck?

Could you share the full command and full training log?

full command:

docker run --rm --runtime=nvidia --gpus all --ipc=host --privileged --ulimit memlock=-1 --ulimit stack=67108864 -v /mnt/sdb/harry_files/share/actionRecognitionNet:/shared_volume -v /mnt/sdc2/ADAS_SW/Harry/datasets/assembly101/mini_split5:/dataset 610ffeb5262c  action_recognition train -e /shared_volume/specs/train_rgb_2d_finetune_test.yaml  -r /shared_volume/results/a101_exp_debug -k nvidia_tao model_config.rgb_pretrained_model_path=/shared_volume/models/pretrained/resnet18_2d_rgb_hmdb5_32.tlt model_config.rgb_pretrained_num_classes=5 > trainlogs_debug_forums.txt


Trainning log:
trainlogs_debug_forums.txt (46.4 KB)

Can you share the $nvidia-smi in your host machine?

Attaching the results $nvidia-smi for your reference

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02   Driver Version: 470.199.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 77%   60C    P8    34W / 350W |   4409MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    416208      G   /usr/lib/xorg/Xorg                120MiB |
|    0   N/A  N/A    416353      G   /usr/bin/gnome-shell               13MiB |
|    0   N/A  N/A    427255      C   python                           4237MiB |
|    0   N/A  N/A    430845      G   ...8/usr/lib/firefox/firefox       11MiB |
+-----------------------------------------------------------------------------+

Please update to 525 driver and retry.

Uninstall:
sudo apt purge nvidia-driver-470
sudo apt autoremove
sudo apt autoclean

Install:
sudo apt install nvidia-driver-525

Apologies for the delay. I want to update you that I have successfully upgraded the driver from version 470 to version 525. Despite this upgrade, the issue we were facing earlier remains unresolved, and unfortunately, the training process is still excessively slow.

Could you please check with official notebook if it can reproduce under your environment? Thanks a lot.

Download notebook from TAO Toolkit Quick Start Guide - NVIDIA Docs
getting started resource on NGC → TAO Toolkit Getting Started | NVIDIA NGC → wget --content-disposition ‘https://api.ngc.nvidia.com/v2/resources/nvidia/tao/tao-getting-started/versions/4.0.2/files/notebooks/tao_launcher_starter_kit/action_recognition_net/actionrecognitionnet.ipynb’ →

I am currently working in a server-like setup, which unfortunately doesn’t allow me to run training through a notebook. Instead, I directly invoke the training container. I have successfully executed the demo fine-tuning process once again and have attached it for your reference. The demo ran smoothly without any problems and at a faster pace . Thank you for your support in this process logs_demo_tao.txt (95.7 KB)

From logs_demo_tao.txt, it is loading trained weights from /shared_volume/results/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt . It is running 3d model.

From trainlogs_debug_forums.txt , it is loading trained weights from /shared_volume/models/pretrained/resnet18_2d_rgb_hmdb5_32.tlt. It is running 2d model.

Could you please run 3d model as well against your own dataset? You can also refer to the training spec in the notebook. For example, set lr to 0.01.

Thank you for your support. I wanted to share my findings with you. After reducing the rgb_sequence_length from 32 to 3 in my dataset, I noticed a significant improvement in training speed. However, I have concerns about whether lowering the rgb_sequence_length might lead to a loss of important temporal information during inference, particularly for the 2D action recognition task on the assembly101 coarse action dataset.

The assembly101 dataset contains assembly-related actions such as screwing and mounting, where capturing fine actions across time is crucial for accurate recognition. While the reduced rgb_sequence_length seems to avoid bottlenecks in the TAO training pipeline, I worry it may compromise the accuracy of the model on this specific dataset.

I would appreciate your guidance, suggestions, or insigths on determining the ideal rgb_sequence_length that balances efficient training without sacrificing accuracy, so I can fine-tune the 2D action recognition model effectively for the assembly101 dataset

Usually, reducing the rgb_sequence_length from 32 to 3 will result in a lower accuracy. But we still need to check the training speed. Could you please do apple-to-apple comparison while using the same model?
You can download train_rgb_2d.yaml and run against the dataset mentioned in the notebook. For train_rgb_2d.yaml , please run

$wget --content-disposition 'https://api.ngc.nvidia.com/v2/resources/nvidia/tao/tao-getting-started/versions/4.0.2/files/notebooks/tao_launcher_starter_kit/action_recognition_net/specs/train_rgb_2d.yaml

which is also from
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/resources/tao-getting-started/files?version=4.0.2

You can set the same lr.

PFA for the logs of the above-mentioned trainning.
logs_demo_2d_rgb_tao.txt (666.8 KB)

Thanks for the result. So, when you train HMDB51 dataset with 2D or 3D model, the training speed is normal.
But when you run training against “assembly101 dataset”, with the same training parameters(such as rgb_sequence_length: 32), the training becomes slow, correct?

Thank you for your continued support. I would like to clarify that during the previous training on the HMDB51 dataset, the ‘rgb_sequence_length’ was maintained at its original value of 3 in the configuration file. When attempting to increase the ‘rgb_sequence_length’ to 32, I observed a considerable reduction in training speed. Now, as I am to fine-tune the same model on the assembly101 dataset, I am seeking the optimal ‘rgb_sequence_length’ that strikes a balance between achieving high accuracy and not significantly slowing down the training speed.. Thank you once again in advance.

For the trade-off between rgb_sequence_length and accuracy, suggest you to run experiments against part of dataset to get the optimal parameter.

1 Like