DS 7.0
dGPU or GeForce RTX 3060 in a laptop
TAO toolkit.
I’m looking for some hints. I was trying to follow this tutuorial TAO Toolkit Quick Start Guide - NVIDIA Docs, but I must have made some mistakes.
I first tried to run the getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/detectnet_v2/detectnet_v2.ipynb
. To no avail. In the end I dropped that and followed the Jupyter script.
These are my steps:
- Create a folder
tao-experiments
on my host system. Created adata
subdir underneath and copied the separately downloaded model to there (Step 2 D). - I setup all requirements, built a conda VM, fired it up and setup my environment:
export NUM_GPUS=1
export USER_EXPERIMENT_DIR=/workspace/tao-experiments
export LOCAL_EXPERIMENT_DIR=/home/ubuntu/tao-experiments
export DATA_DOWNLOAD_DIR=/workspace/tao-experiments/data
export LOCAL_PROJECT_DIR=/home/ubuntu/tao-experiments
export LOCAL_DATA_DIR=/home/ubuntu/tao-experiments/data
export VIRTUALENVWRAPPER_PYTHON=/home/ubuntu/anaconda3/envs/launcher/bin/python
export LOCAL_SPECS_DIR=/home/ubuntu/tao-experiments/detectnet_v2/specs
export SPECS_DIR=/workspace/tao-experiments/detectnet_v2/specs
My ~/.tao-mounts that time looked like so:
{
"Mounts": [
{
"source": "/home/ubuntu/tao-experiments",
"destination": "/workspace/tao-experiments"
},
{
"source": "/home/ubuntu/tao-experiments/detectnet_v2/specs",
"destination": "/workspace/tao-experiments/detectnet_v2/specs"
}
],
"DockerOptions":{
"user": "1000:1000"
}
}
- 2C failed immediately. A permission issue while attempting to create a directory. Which one not reported. I found a post here, which sugested to remove the “user”: “1000:1000” and it worked then.
So step 2C passed now:
tao model detectnet_v2 dataset_convert \
-d $SPECS_DIR/detectnet_v2_tfrecords_kitti_trainval.txt \
-o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval \
-r $USER_EXPERIMENT_DIR/
and these commands showed useful results.
ls -rlt $LOCAL_DATA_DIR/tfrecords/kitti_trainval/
cat $LOCAL_SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt
I then trained the model with the kitti dataset for about 7 h successfully:
tao model detectnet_v2 train -e $SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt \
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
-n resnet18_detector \
--gpus $NUM_GPUS
and the suggested command showed up with something:
(launcher) ubuntu@simulator:~/tao-experiments$ ls -lh $LOCAL_EXPERIMENT_DIR/experiment_dir_unpruned/weights
total 43M
-rw-r--r-- 1 root root 43M May 26 02:39 resnet18_detector.hdf5
- Step 5: Evaluate the trained model:
Was ok:
tao model detectnet_v2 evaluate -e $SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt\
-m $USER_EXPERIMENT_DIR/experiment_dir_unpruned/weights/resnet18_detector.hdf5
- Step 6: Prune the trained model - worked
mkdir -p $LOCAL_EXPERIMENT_DIR/experiment_dir_pruned
ls $LOCAL_EXPERIMENT_DIR/experiment_dir_pruned
// Result
resnet18_nopool_bn_detectnet_v2_pruned.hdf5
tao model detectnet_v2 prune \
-m $USER_EXPERIMENT_DIR/experiment_dir_unpruned/weights/resnet18_detector.hdf5 \
-o $USER_EXPERIMENT_DIR/experiment_dir_pruned/resnet18_nopool_bn_detectnet_v2_pruned.hdf5 \
-eq union \
-pth 0.0000052
Finally
ls -rlt $LOCAL_EXPERIMENT_DIR/experiment_dir_pruned/
// Result
total 34092
-rw-r--r-- 1 root root 34903192 May 26 07:34 resnet18_nopool_bn_detectnet_v2_pruned.hdf5
- Then I came to step 7 (Retrain the pruned model) and didn’t expect problems anymore, but there was one:
tao model detectnet_v2 train -e $SPECS_DIR/detectnet_v2_retrain_resnet18_kitti.txt \
-r $USER_EXPERIMENT_DIR/experiment_dir_retrain \
-n resnet18_detector_pruned \
--gpus $NUM_GPUS
Out of the sudden this step was unable to find Pretrained model file not found: /workspace/tao-experiments/detectnet_v2/experiment_dir_pruned/resnet18_nopool_bn_detectnet_v2_pruned.hdf5
The strange thing: This model file exists, but not where the tao app expects it. And if I’m not wrong it has been created during the execution of the Jupyter in step 6, Prune the trained model
mkdir -p $LOCAL_EXPERIMENT_DIR/experiment_dir_pruned
experiment_dir_pruned
is clearly produced below $LOCAL_EXPERIMENT, so why it is resolves under $LOCAL_EXPERIMENT/detectnet_v2/experiment_dir_pruned
?
.
├── data
├── detectnet_v2
├── experiment_dir_pruned
├── experiment_dir_retrain
├── experiment_dir_unpruned
└── status.json
In the end a altered ~/.tao_mounts.json
and added a special mapping for this directory:
{
"source": "/home/ubuntu/tao-experiments/experiment_dir_pruned",
"destination": "/workspace/tao-experiments/detectnet_v2/experiment_dir_pruned"
}
Training passed then. But I don’t think this is correct. So what did I wrong?
- I was able to perform step 8 “Evaluate the retrained model” w/o problem
tao model detectnet_v2 evaluate -e $SPECS_DIR/detectnet_v2_retrain_resnet18_kitti.txt \
-m $USER_EXPERIMENT_DIR/experiment_dir_retrain/weights/resnet18_detector_pruned.hdf5
- And also step 9 “Visualize inferences” worked
mkdir -p $LOCAL_DATA_DIR/test_samples
cp $LOCAL_DATA_DIR/testing/image_2/00000* $LOCAL_DATA_DIR/test_samples
tao model detectnet_v2 inference -e $SPECS_DIR/detectnet_v2_inference_kitti_tlt.txt \
-r $USER_EXPERIMENT_DIR/tlt_infer_testing \
-i $DATA_DOWNLOAD_DIR/test_samples
But also this script complained:
AssertionError: Pretrained model not found at /workspace/tao-experiments/detectnet_v2/experiment_dir_retrain/weights/resnet18_detector_pruned.hdf5
Luckily tao model detectnet_v2
comes to help and reveals: The file is really not there:
root@e3d1eb437087:/workspace# ls /workspace/tao-experiments/detectnet_v2/experiment_dir_retrain/weights/resnet18_detector_pruned.hdf5
ls: cannot access '/workspace/tao-experiments/detectnet_v2/experiment_dir_retrain/weights/resnet18_detector_pruned.hdf5': No such file or directory
But it is here:
root@e3d1eb437087:/workspace# ls /workspace/tao-experiments/experiment_dir_retrain/weights/resnet18_detector_pruned.hdf5
/workspace/tao-experiments/experiment_dir_retrain/weights/resnet18_detector_pruned.hdf5
So same thing: While earlier scripts have placed something in a directory, following scripts expect the results on step down the directory tree:
Applied the same mapping as before by adding this to ~/.tao_mounts.json
{
"source": "/home/ubuntu/tao-experiments/experiment_dir_retrain",
"destination": "/workspace/tao-experiments/detectnet_v2/experiment_dir_retrain"
}
Question is: What do I oversee, why do I have to alter the mapping?
The final step again 10. Model export didn’t need any help. I created the experiment_final_dir
as subdir under /home/ubuntu/tao-experiments
and it perfectly produced model, labels and config.