Re_identification_net in TAO 5.3.0 does not generate checkpoints

Hi, I recently upgraded the version I was using of TAO CLI to :

Configuration of the TAO Toolkit Instance
task_group: ['model', 'dataset', 'deploy']]
format_version: 3.0
toolkit_version: 5.3.0
published_date: 03/14/2024

I am trying to run the reidentificationnet_resnet.ipynb notebook:

tao model re_identification train \
                  -e $SPECS_DIR/experiment_market1501_resnet.yaml \
                  -r $RESULTS_DIR/Pilar_11cam_ReID \
                  -k $KEY

Then for some reason it is not generating the chekpoints inside the train folder. It correctly generates the train/lightning_logs/version_0 folder, with the hparams.yaml and events.out.tfevents.....server-training files but not the checkpoints folder. The previous version did not give me this problem, but it was not taking the 11 cameras that has the custom dataset that I am using due to the re. compile(r'([-d]+)_c(\d)') pattern in the source code, which only takes one digit after the letter c. I share with you the content of the .tao_mounts.json file and the configuration file I am using.

{
    "Mounts": [
        {
            "source": "/home/minigo/Desktop/TAO_toolkit/tao-getting-started_v5.3.0/Pilar_11cam_ReID",
            "destination": "/workspace/tao-experiments"
        },
        {
            "source": "/home/minigo/Desktop/TAO_toolkit/tao-getting-started_v5.3.0/Pilar_11cam_ReID/data/reidentificationnet",
            "destination": "/data"
        },
        {
            "source": "/home/minigo/Desktop/TAO_toolkit/tao-getting-started_v5.3.0/Pilar_11cam_ReID/data/reidentificationnet/model",
            "destination": "/model"
        },
        {
            "source": "/home/minigo/Desktop/TAO_toolkit/tao-getting-started_v5.3.0/notebooks/tao_launcher_starter_kit/re_identification_net/specs",
            "destination": "/specs"
        },
        {
            "source": "/home/minigo/Desktop/TAO_toolkit/tao-getting-started_v5.3.0/Pilar_11cam_ReID/reidentificationnet",
            "destination": "/results"
        }
    ],
    "DockerOptions": {
        "shm_size": "16G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
        }
    }
}

experiment_market1501_resnet.yaml :

results_dir: "/results/Pilar_11cam_ReID"
encryption_key: nvidia_tao
model:
  backbone: resnet_50
  last_stride: 1
  pretrain_choice: imagenet
  pretrained_model_path: "/results/pretrained/reidentificationnet_vtrainable_v1.1/resnet50_market1501_aicity156.tlt" 
  input_channels: 3
  input_width: 128
  input_height: 256
  neck: bnneck
  feat_dim: 256
  neck_feat: after
  metric_loss_type: triplet
  with_center_loss: False
  with_flip_feature: False
  label_smooth: True
dataset:
  train_dataset_dir: "/data/Pilar_11cam_ReID/sample_train"
  test_dataset_dir: "/data/Pilar_11cam_ReID/sample_test"
  query_dataset_dir: "/data/Pilar_11cam_ReID/sample_query"
  num_classes: 58
  batch_size: 64
  val_batch_size: 128
  num_workers: 1
  pixel_mean: [0.485, 0.456, 0.406]
  pixel_std: [0.226, 0.226, 0.226]
  padding: 10
  prob: 0.5
  re_prob: 0.5
  sampler: softmax_triplet
  num_instances: 4
re_ranking:
  re_ranking: True
  k1: 20
  k2: 6
  lambda_value: 0.3
train:
  optim:
    name: Adam
    lr_steps: [40, 70]
    gamma: 0.1
    bias_lr_factor: 1
    weight_decay: 0.0005
    weight_decay_bias: 0.0005
    warmup_factor: 0.01
    warmup_iters: 10
    warmup_method: linear
    base_lr: 0.00035
    momentum: 0.9
    center_loss_weight: 0.0005
    center_lr: 0.5
    triplet_loss_margin: 0.3
  num_epochs: 120
  checkpoint_interval: 10

I would appreciate any guidance in this regard.

To narrow down, suggest you to open a terminal and trigger the docker and run inside it. Refer to Train re_identification_net with more than 10 cameras using TAO - #5 by Morganh. You can also use docker commit xxx to save the code change.
Try to run 1 or 2 epochs to check if checkpoints are saved.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.