Hi there, I am a beginner to Tao toolkit. I am following the insttruction of the OCDnet notebook and correctly 1.set up the env variables and map drives 2. Installing the TAO launcher 3. set up the trainning spec (actaully it is as provided ).
Please provide the following information when requesting support.
• Hardware: RTX 3060ti
• Network Type : ocdnet_vtrainable_resnet18_v1.0
• TLT Version Configuration of the TAO Toolkit Instance
task_group: [‘model’, ‘dataset’, ‘deploy’]
format_version: 3.0
toolkit_version: 5.5.0
published_date: 08/26/2024
• Training spec file(If have, please share here)
model:
load_pruned_graph: False
pruned_graph_path: ‘/results/prune/pruned_0.1.pth’
pretrained_model_path: ‘/data/ocdnet/ocdnet_deformable_resnet18.pth’
backbone: deformable_resnet18
train:
results_dir: /results/train
num_gpus: 1
num_epochs: 30
#resume_training_checkpoint_path: ‘/results/train/resume.pth’
checkpoint_interval: 1
validation_interval: 1
trainer:
clip_grad_norm: 5.0
optimizer:
type: Adam
args:
lr: 0.001
lr_scheduler:
type: WarmupPolyLR
args:
warmup_epoch: 3
post_processing:
type: SegDetectorRepresenter
args:
thresh: 0.3
box_thresh: 0.55
max_candidates: 1000
unclip_ratio: 1.5
metric:
type: QuadMetric
args:
is_output_polygon: false
dataset:
train_dataset:
data_path: [‘/data/ocdnet/train’]
args:
pre_processes:
- type: IaaAugment
args:
- {‘type’:Fliplr, ‘args’:{‘p’:0.5}}
- {‘type’: Affine, ‘args’:{‘rotate’:[-10,10]}}
- {‘type’:Resize,‘args’:{‘size’:[0.5,3]}}
- type: EastRandomCropData
args:
size: [640,640]
max_tries: 50
keep_ratio: true
- type: MakeBorderMap
args:
shrink_ratio: 0.4
thresh_min: 0.3
thresh_max: 0.7
- type: MakeShrinkMap
args:
shrink_ratio: 0.4
min_text_size: 8
img_mode: BGR
filter_keys: [img_path,img_name,text_polys,texts,ignore_tags,shape]
ignore_tags: ['*', '###']
loader:
batch_size: 20
pin_memory: true
num_workers: 12
validate_dataset:
data_path: [‘/data/ocdnet/test’]
args:
pre_processes:
- type: Resize2D
args:
short_size:
- 1280
- 736
resize_text_polys: true
img_mode: BGR
filter_keys:
ignore_tags: [‘*’, ‘###’]
loader:
batch_size: 1
pin_memory: false
num_workers: 1
The mount.json is as follows:
{
“Mounts”: [
{
“source”: “/home/cc/Documents/tao-ocd-dir”,
“destination”: “/workspace/tao-experiments”
},
{
“source”: “/home/cc/Documents/tao-ocd-dir/data/ocdnet”,
“destination”: “/data/ocdnet”
},
{
“source”: “/home/cc/Documents/tao_tutorials/notebooks/tao_launcher_starter_kit/ocdnet/specs”,
“destination”: “/specs”
},
{
“source”: “/home/cc/Documents/tao-ocd-dir/ocdnet/results”,
“destination”: “/results”
}
],
“DockerOptions”: {
“shm_size”: “16G”,
“ulimits”: {
“memlock”: -1,
“stack”: 67108864
},
“user”: “1000:1000”,
“network”: “host”
}
}
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
When I first time run the cell below, I found that the tao is pulling a lot of conatainers, however my notebook crash, then I am not quite sure if it pulled all the containers needed.
!tao model ocdnet train
-e $SPECS_DIR/train.yaml
results_dir=$RESULTS_DIR \ model.pretrained_model_path=$RESULTS_DIR/pretrained_ocdnet/ocdnet_vtrainable_resnet18_v1.0/ocdnet_deformable_resnet18.pth
But when I reenterd the notebook and run this command, I found that it no longer download any containers but just prompt the warning and error below:
2024-10-17 00:33:21,551 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2024-10-17 00:33:21,593 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2024-10-17 00:33:21,601 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
Error response from daemon: No such container: bddc7ffc435e4e9cea1521c61dd1aeb52c8151e0795e5d7c8c3c65d02460b150
2024-10-17 00:33:22,387 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.
What should I do?