Clarification about ReIdentificatioNet pre-training

corentin87 · August 30, 2024, 1:48am

Hi,

For ReIdentificationNet, a pre-trained model is available in the NGC catalog trainable_v1.1: resnet50_market1501_aicity156.tlt . I can’t find many information about the pre-trained process though.

In this great blogpost https://developer.nvidia.com/blog/enhance-multi-camera-tracking-accuracy-by-fine-tuning-ai-models-with-synthetic-data/, they are saying the network has been pre-trained using SOLIDER technic and on a dataset which:

includes a combination of NVIDIA proprietary datasets along with Open Images V5.

However in the NGC page, they are only mentioning market-1501 + synthetic IDs. Therefore, I am wondering on which dataset has the model been pre-trained? is it on a combination of NVIDIA proprietary datasets along with Open Images or is it on market-1501 + synthetic data?

Also the blog mention, fine-tuning using 4470 real IDs, does that mean the model tested here is different from deployable_v1.2 for NGC?

Thank you

Morganh · August 30, 2024, 2:56am

The deployable_v1.2 is trained with 14737 images from Market-1501 dataset of 751 real people and 29533 images of 156 people (148 of which are synthetic) from MTMC people tracking dataset from the 2023 AI City Challenge.

The training dataset is mentioned in ngc model page. Refer to ReIdentificationNet | NVIDIA NGC.

corentin87 · August 30, 2024, 4:44am

Thanks for the answer. However it is still not clear to me. Your two replies are mentioning the same information

Therefore:

is deployable_v1.2 fine-tune from trainable_v1.1? or is it the export version?
If they are the same, what is the pre-trained model used to train trainable_v1.1?
what about the the pre-trained model describe in the blog?

Morganh · September 2, 2024, 2:45pm

Yes, it is the export version. The models in model card are trained on Market1501 + synthetic datasets.

The pretrained-model mentioned in the blog or mentioned in notebook tao_tutorials/notebooks/tao_launcher_starter_kit/re_identification_net/reidentificationnet_swin.ipynb at main · NVIDIA/tao_tutorials · GitHub is trained on unlabeled datasets. We train it on ~3M image unlabeled crops of people objects. Reidentificationnet_transformer is the network trained on combination of NVIDIA proprietary datasets along with Open Images V5 (~3M images).

corentin87 · September 6, 2024, 6:14am

Ok thanks for the reply.

I have another question regarding the training output displayed
For the training, I have :

Subset   │   # IDs │   # Images │   
╞══════════╪═════════╪
│ Train    │     162 │      41712

and I have configured:

num_classes: 162
batch_size: 128
val_batch_size: 64
num_workers: 4
num_instances: 4

Therefore, we should have 41712/128 = 236 batches for each epoch. I am wondering why TAO is showing 570?
And why are the epochs split into 2? thought this was due to num_workers but even setting it up to 1 shows a split epochs output.

Thanks

Morganh · September 6, 2024, 9:05am

Should be related to validation dataset. It contains two parts. One is training part, another is validation part.

corentin87 · September 11, 2024, 1:45am

ok starting to get there but I still have unclear numbers:

I have a

41712 train images, divided by 128 batch size → 326 steps.
13434+2349 valid images (query+ gallery), divided by 64 batch size → 246 steps.

So when getting the logs I should see:

Training loop in progress: 326/570
Train and Val metrics generated: 246/570

Which is not the case here

(I’ve wrote a typo in my previous message above and 41712/128=326)

Subset      # Ids    # Images
------------------------------
Train       162      41712
Query       67       2349
Gallery     67       13434

Thanks

Morganh · September 11, 2024, 2:09am

Train: round(41712/128) = 325
Query: round(2349/64) = 36
Gallery: round(13434/64) = 209

Totally, 325 + 36 + 209 = 570. It matches the log 570.

corentin87 · September 11, 2024, 2:13am

thanks but I am not saying it doesn’t match the logs anymore regarding 570.
But the training is saying ~270/570 while it should be 325/570

Morganh · September 11, 2024, 6:36am

OK. If possible, please check if you meet similar behavior by running the dataset mentioned in notebook tao_tutorials/notebooks/tao_launcher_starter_kit/re_identification_net at main · NVIDIA/tao_tutorials · GitHub.

corentin87 · September 12, 2024, 4:25am

Using the starter kit with batch_size=64 and val_batch_size=128

Subset      # Ids    # Images
------------------------------
Train       100      1583
Query       100       445
Gallery     100       1756

Train: round(1583/64) = 25
Query: round(445/128) = 4
Gallery: round(1756/128) = 14

Totally, 25 + 4 +14 = 43

Looks like logs are showing something different. If above 570 was the total train+test here we should get 25/43 and 18/43 but instead the bar is going to 95% and it seems like the total number is /train or /test not /(test+train).
Why is it different depending on the training?

Epoch 0:  95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▊     | 21/22 [00:05<00:00,  3.79it/s, loss=5.37, v_num=0]Train and Val metrics generated.
Epoch 0:  95%|███████████████████████████████████████████████████████▎  | 21/22 [00:05<00:00,  3.76it/s, loss=5.37, v_num=0, train_loss=5.510, base_lr=3.81e-5, train_acc_1=0.0134]Training loop in progress
Epoch 1:  95%|███████████████████████████████████████████████████████▎  | 21/22 [00:03<00:00,  5.34it/s, loss=5.13, v_num=0, train_loss=5.510, base_lr=3.81e-5, train_acc_1=0.0134]Train and Val metrics generated.
Epoch 1:  95%|███████████████████████████████████████████████████████▎  | 21/22 [00:03<00:00,  5.30it/s, loss=5.13, v_num=0, train_loss=5.130, base_lr=3.81e-5, train_acc_1=0.0201]Training loop in progress

Morganh · September 18, 2024, 4:11pm

Thanks for the info. I will try to reproduce on my side.

corentin87 · September 18, 2024, 11:49pm

Great!

Morganh · October 8, 2024, 12:59am

I cannot reproduce similar result with 5.5.0 pyt docker. I run with default dataset mentioned in the notebook. I try num_instances: 1 or default num_instances: 4.
Suggest you to add debug code in
tao_pytorch_backend/nvidia_tao_pytorch/cv/re_identification/dataloader/sampler.py at main · NVIDIA/tao_pytorch_backend · GitHub to check total_batches, self.batch_size, self.num_instances, self.num_pids_per_batch and total_full_batches., etc.

yingliu · November 1, 2024, 8:47am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

system · November 15, 2024, 8:48am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Retrained model shows worse results compared to NVIDIA provided model TAO Toolkit	16	188	May 29, 2024
TAO action recogniton net trainning extremely slow TAO Toolkit tao	20	620	August 7, 2023
Loss, acc, val_acc get stablized soon in both train and re-train TAO Toolkit	6	434	July 3, 2023
Error while re-training with custom dataset using tlt file- FasterRCNN TAO Toolkit	5	352	June 26, 2023
TAO re_identification export failure TAO Toolkit	5	486	September 26, 2023
Poor metric results after retraining maskrcnn using TLT notebook TAO Toolkit	23	2396	August 3, 2021
Why are there still so many trainable parameters even after freezing all the layers? TAO Toolkit	8	436	February 6, 2024
Is it possible to adjust class_weight in YOLOv4 like DetectNet v2? TAO Toolkit	7	1243	October 12, 2021
Fine Tuning Retail Object Detection Models provided in NGC TAO Toolkit ngc	9	37	November 20, 2024
Tao speech_to_text evaluate+infer show very weak results TAO Toolkit	26	2039	March 8, 2022

Clarification about ReIdentificatioNet pre-training

Related topics