Hi, I recently upgraded the version I was using of TAO CLI to :
Configuration of the TAO Toolkit Instance
task_group: ['model', 'dataset', 'deploy']]
format_version: 3.0
toolkit_version: 5.3.0
published_date: 03/14/2024
I am trying to run the reidentificationnet_resnet.ipynb
notebook:
tao model re_identification train \
-e $SPECS_DIR/experiment_market1501_resnet.yaml \
-r $RESULTS_DIR/Pilar_11cam_ReID \
-k $KEY
Then for some reason it is not generating the chekpoints inside the train folder. It correctly generates the train/lightning_logs/version_0
folder, with the hparams.yaml
and events.out.tfevents.....server-training
files but not the checkpoints
folder. The previous version did not give me this problem, but it was not taking the 11 cameras that has the custom dataset that I am using due to the re. compile(r'([-d]+)_c(\d)')
pattern in the source code, which only takes one digit after the letter c. I share with you the content of the .tao_mounts.json
file and the configuration file I am using.
{
"Mounts": [
{
"source": "/home/minigo/Desktop/TAO_toolkit/tao-getting-started_v5.3.0/Pilar_11cam_ReID",
"destination": "/workspace/tao-experiments"
},
{
"source": "/home/minigo/Desktop/TAO_toolkit/tao-getting-started_v5.3.0/Pilar_11cam_ReID/data/reidentificationnet",
"destination": "/data"
},
{
"source": "/home/minigo/Desktop/TAO_toolkit/tao-getting-started_v5.3.0/Pilar_11cam_ReID/data/reidentificationnet/model",
"destination": "/model"
},
{
"source": "/home/minigo/Desktop/TAO_toolkit/tao-getting-started_v5.3.0/notebooks/tao_launcher_starter_kit/re_identification_net/specs",
"destination": "/specs"
},
{
"source": "/home/minigo/Desktop/TAO_toolkit/tao-getting-started_v5.3.0/Pilar_11cam_ReID/reidentificationnet",
"destination": "/results"
}
],
"DockerOptions": {
"shm_size": "16G",
"ulimits": {
"memlock": -1,
"stack": 67108864
}
}
}
experiment_market1501_resnet.yaml :
results_dir: "/results/Pilar_11cam_ReID"
encryption_key: nvidia_tao
model:
backbone: resnet_50
last_stride: 1
pretrain_choice: imagenet
pretrained_model_path: "/results/pretrained/reidentificationnet_vtrainable_v1.1/resnet50_market1501_aicity156.tlt"
input_channels: 3
input_width: 128
input_height: 256
neck: bnneck
feat_dim: 256
neck_feat: after
metric_loss_type: triplet
with_center_loss: False
with_flip_feature: False
label_smooth: True
dataset:
train_dataset_dir: "/data/Pilar_11cam_ReID/sample_train"
test_dataset_dir: "/data/Pilar_11cam_ReID/sample_test"
query_dataset_dir: "/data/Pilar_11cam_ReID/sample_query"
num_classes: 58
batch_size: 64
val_batch_size: 128
num_workers: 1
pixel_mean: [0.485, 0.456, 0.406]
pixel_std: [0.226, 0.226, 0.226]
padding: 10
prob: 0.5
re_prob: 0.5
sampler: softmax_triplet
num_instances: 4
re_ranking:
re_ranking: True
k1: 20
k2: 6
lambda_value: 0.3
train:
optim:
name: Adam
lr_steps: [40, 70]
gamma: 0.1
bias_lr_factor: 1
weight_decay: 0.0005
weight_decay_bias: 0.0005
warmup_factor: 0.01
warmup_iters: 10
warmup_method: linear
base_lr: 0.00035
momentum: 0.9
center_loss_weight: 0.0005
center_lr: 0.5
triplet_loss_margin: 0.3
num_epochs: 120
checkpoint_interval: 10
I would appreciate any guidance in this regard.