Why action recognition 3D accuracy gets much worse if set rgb_seq_length to a size other than 3

• Hardware (Nano)
JetPack 4.5.1
• Network Type (action recognition 3D)
• TAO Version (3.21.11)
• Training spec file(action_recognition_net/specs/train_rgb_3d_finetune.yaml)
• How to reproduce the issue ?

I downloaded cv_samples_v1.3.0.zip by
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/tao/cv_samples/versions/v1.3.0/zip -O cv_samples_v1.3.0.zip
and followed the steps listed in action_recognition_net/actionrecognitionnet.ipynb to have downloaded hmdb51_org.rar and processed and split out HMDB51 train/test dataset, I picked out such 10 classes data from them for train/test:

clap:  0
drink: 1
punch: 2
push:  3
run:   4
shake_hands: 5
sit:   6 
smoke: 7
turn:  8
wave:  9

and trained the model in actionrecognitionnet.ipynb with the default config action_recognition_net/specs/train_rgb_3d_finetune.yaml (only changed label_map):

!tao action_recognition train \
                  -e $SPECS_DIR/action_recognition_net/specs/train_rgb_3d_finetune.yaml \
                  -r $RESULTS_DIR/rgb_3d_ptm \
                  -k $KEY \
model_config.rgb_pretrained_model_patth=$RESULTS_DIR/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt \  
                  model_config.rgb_pretrained_num_classes=5

and got the following accuracy data:

*******************************
clap          30.0
drink         30.0
punch         70.0
push          86.67
run           83.33
shake_hands   63.33
sit           40.0
smoke         30.0
turn          40.0
wave          16.67
*******************************
Total accuracy: 49.0
Average class accuracy: 49.0

by the command:

!tao action_recognition evaluate \
                    -e $SPECS_DIR/action_recognition_net/specs/evaluate_rgb.yaml \
                    -k $KEY \
                    model=$RESULTS_DIR/rgb_3d_ptm/rgb_only_model.tlt  \
                    batch_size=1 \
                    test_dataset_dir=$DATA_DIR/test \
                    video_eval_mode=center

After I changed rgb_seq_length to 32 in train_rgb_3d_finetune.yaml and trained the model again and got the following much worse accuracy data:

*******************************
clap          0.0
drink         0.0
punch         0.0
push          6.667
run           100.0
shake_hands   0.0
sit           0.0
smoke         0.0
turn          0.0
wave          0.0
*******************************
Total accuracy: 10.667
Average class accuracy: 10.667

I also trained the model with rgb_seq_length: 32 and without the pretrained model resnet18_3d_rgb_hmdb5_32.tlt:

!tao action_recognition train \
                  -e $SPECS_DIR/action_recognition_net/specs/train_rgb_3d_finetune.yaml \
                  -r $RESULTS_DIR/rgb_3d_ptm \
                  -k $KEY

then also got the following bad accuracy data:

*******************************
clap          0.0
drink         0.0
punch         0.0
push          0.0
run           86.67
shake_hands   0.0
sit           0.0
smoke         0.0
turn          0.0
wave          10.0
*******************************
Total accuracy: 9.667
Average class accuracy: 9.667

I also tried setting rgb_seq_length to 8 or 16, got the similar bad accuracy data.

I checked the images under the directory of each class, there are enough images for traing with rgb_seq_length=32, why I got the so bad accuracy data? is there any other parameter I need tune accordingly? Thanks.
train_rgb_3d_finetune.yaml (901 Bytes)

Could you try
-enlarge batch size
-increase epochs
-increase learning rate
-and do some augmentation.

For example,

rgb_seq_length: 32

train_config:                                                                       
  optim:                                                                            
    lr: 0.01                                                                        
    momentum: 0.9                                                                   
    weight_decay: 0.0005                                                            
    lr_steps: [30, 60, 80]                                                          
    lr_decay: 0.1                                                                   
    epochs: 100                                                                       

 ...
 batch_size: 64                                                                   
 
  ...
  augmentation_config:                                                              
    train_crop_type: random_crop                                                    
    horizontal_flip_prob: 0.5                                                       
    rgb_input_mean: [0.5]                                                           
    rgb_input_std: [0.5]                                                            
    val_center_crop: True                                                           
    crop_smaller_edge: 256

Hi Morganh,

Following the config you pasted above, I trained actionrecognitionnet again and got the following accuracy data:

*******************************
clap          0.0
drink         0.0
punch         96.67
push          0.0
run           13.33
shake_hands   0.0
sit           0.0
smoke         0.0
turn          0.0
wave          3.333
*******************************
Total accuracy: 11.333
Average class accuracy: 11.333

It is strange that most classes still got zero accuracy.
As there is no source code of train and network, don’t know why, please share me info if you have any new finding.

I’ll also have a try with rgb_seq_length=8 or 16.

As the data amount is large, I cannot upload the 10 classes data, you can follow the steps in
actionrecognitionnet.ipynb to download hmdb51_org.rar and only extract out the 10 classes by

!unrar x $HOST_DATA_DIR/videos/clap.rar $HOST_DATA_DIR/raw_data
!unrar x $HOST_DATA_DIR/videos/drink.rar $HOST_DATA_DIR/raw_data
!unrar x $HOST_DATA_DIR/videos/punch.rar $HOST_DATA_DIR/raw_data
!unrar x $HOST_DATA_DIR/videos/push.rar $HOST_DATA_DIR/raw_data
!unrar x $HOST_DATA_DIR/videos/run.rar $HOST_DATA_DIR/raw_data
!unrar x $HOST_DATA_DIR/videos/shake_hands.rar $HOST_DATA_DIR/raw_data
!unrar x $HOST_DATA_DIR/videos/sit.rar $HOST_DATA_DIR/raw_data
!unrar x $HOST_DATA_DIR/videos/smoke.rar $HOST_DATA_DIR/raw_data
!unrar x $HOST_DATA_DIR/videos/turn.rar $HOST_DATA_DIR/raw_data
!unrar x $HOST_DATA_DIR/videos/wave.rar $HOST_DATA_DIR/raw_data

and make out the 10-class dataset following the processing and splitting data steps in actionrecognitionnet.ipynb, and have a training to see the accuracy result.

Thanks for the info. I will check further.

I trained with rgb_seq_length=8 or 16 and got the accuracy data as following:

rgb_seq_length: 8

*******************************
clap          0.0
drink         0.0
punch         13.33
push          43.33
run           70.0
shake_hands   0.0
sit           0.0
smoke         3.333
turn          10.0
wave          3.333
*******************************

Total accuracy: 14.333
Average class accuracy: 14.333

rgb_seq_length: 16

*******************************
clap          0.0
drink         0.0
punch         0.0
push          60.0
run           10.0
shake_hands   0.0
sit           0.0
smoke         20.0
turn          0.0
wave          13.33
*******************************
Total accuracy: 10.333
Average class accuracy: 10.333

train_rgb_3d_finetune.yaml (961 Bytes)

Could you share the log when you run training with

  • rgb_seq_length: 8
  • rgb_seq_length: 16
  • rgb_seq_length: 32

More, please share your evaluate_rgb.yaml as well.
Thanks.

Hi Morganh,

Sorry, I’ve just found the cause that most classes have 0 accuracy is I forgot to change rgb_seq_length in evaluate_rgb.yaml, after did changing accordingly, the result got better:

*******************************
clap          33.33
drink         26.67
punch         46.67
push          90.0
run           80.0
shake_hands   66.67
sit           43.33
smoke         30.0
turn          26.67
wave          3.333
*******************************
Total accuracy: 44.667
Average class accuracy: 44.667

But some classes still have a very low accuracy, is this an expected result ? configure and train/evaluate logs are attached, thanks.

evaluate_rgb.yaml (569 Bytes)
evaluate_rgb_seq_length=16.txt|attachment (6.6 KB)
train_with_rgb_seq_length=16.txt|attachment (246.5 KB)

Some classes have better accuracy and some have not. Could you train with full classes instead of 10 classes?

OK, I’ll do it

Hi Morganh,

I trained actionrecognitionnet with the whole HMDB51 dataset and got the following accuracy:

*******************************
brush_hair    26.67
cartwheel     6.667
catch         70.0
chew          40.0
clap          30.0
climb         56.67
climb_stairs  20.0
dive          56.67
draw_sword    33.33
dribble       53.33
drink         16.67
eat           6.667
fall_floor    16.67
fencing       43.33
flic_flac     30.0
golf          83.33
handstand     30.0
hit           3.333
hug           63.33
jump          13.33
kick_ball     16.67
kick          10.0
kiss          86.67
laugh         33.33
pick          3.333
pour          66.67
pullup        86.67
punch         0.0
push          70.0
pushup        46.67
ride_bike     86.67
ride_horse    33.33
run           50.0
shake_hands   36.67
shoot_ball    36.67
shoot_bow     83.33
shoot_gun     10.0
sit           23.33
situp         53.33
smile         16.67
smoke         10.0
somersault    40.0
stand         20.0
swing_baseball10.0
sword_exercise10.0
sword         20.0
talk          63.33
throw         6.667
turn          3.333
walk          23.33
wave          0.0
*******************************
Total accuracy: 34.444
Average class accuracy: 34.444

train_rgb_3d_finetune.yaml is attached.
train_rgb_3d_finetune.yaml (1.8 KB)

Does this result approach to your training result ? thanks.

I have not these statistics on hand. So, in terms of original question from you, could you check if the result of smaller rgb-seq-length is better?
More, after syncing with internal team, improving rgb-seq-length may not bring positive effect.

OK, thanks, I’ll have a try with rgb-seq-length=8, as the action data in our own dataset is a longer series of frames, in general, the length is 16 or 32, at least 8.

Hi Morganh,

Sorry, I’m late because of some other work much more urgent to be done.
I finished training actionrecognitionnet with the whole HMDB51 dataset and rgb-seq-length=8, the evaluation result is worse than that got with rgb-seq-length=32:

*******************************
brush_hair    43.33
cartwheel     3.333
catch         43.33
chew          20.0
clap          10.0
climb         60.0
climb_stairs  20.0
dive          40.0
draw_sword    40.0
dribble       46.67
drink         16.67
eat           13.33
fall_floor    13.33
fencing       13.33
flic_flac     20.0
golf          83.33
handstand     20.0
hit           3.333
hug           56.67
jump          6.667
kick_ball     20.0
kick          6.667
kiss          66.67
laugh         23.33
pick          0.0
pour          63.33
pullup        83.33
punch         0.0
push          33.33
pushup        40.0
ride_bike     73.33
ride_horse    33.33
run           16.67
shake_hands   33.33
shoot_ball    33.33
shoot_bow     83.33
shoot_gun     10.0
sit           26.67
situp         36.67
smile         36.67
smoke         13.33
somersault    30.0
stand         6.667
swing_baseball6.667
sword_exercise0.0
sword         23.33
talk          70.0
throw         3.333
turn          13.33
walk          23.33
wave          0.0
*******************************
Total accuracy: 29.085
Average class accuracy: 29.085

Thanks for the experiment. So we cannot draw a conclusion that bigger rgb_seq_length will get worse result.