Could you please upload your latest experiment_classify.yaml? Thanks.
experiment_classify.zip (967 Bytes)
File uploaded. Thanks.
Please change concat_type: grid
to concat_type: linear
and retry. Thanks.
Ok. Actually, I have tried concat_type: linear before.
Failed as below results.
Train model
2023-11-16 08:40:41,341 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2023-11-16 08:40:41,625 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.1.0-pyt
2023-11-16 08:40:41,681 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 262:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/wistron/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
2023-11-16 08:40:41,681 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 275: Printing tty value True
INFO: Created a temporary directory at /tmp/tmpa56ks5du
INFO: Writing /tmp/tmpa56ks5du/_remote_module_non_scriptable.py
INFO: generated new fontManager
sys:1: UserWarning:
‘experiment_classify.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
:107: UserWarning:
‘experiment_classify.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
Train results will be saved at: /results/train
Experiment configuration:
model:
backbone:
type: fan_small_12_p4_hybrid
feat_downsample: false
pretrained_backbone_path: null
decode_head:
in_channels:
- 128
- 256
- 384
- 384
in_index:
- 0
- 1
- 2
- 3
feature_strides:
- 4
- 8
- 16
- 16
align_corners: false
decoder_params:
embed_dim: 256
classify:
train_margin_euclid: 2.0
eval_margin: 0.005
embedding_vectors: 5
embed_dec: 30
learnable_difference_modules: 4
difference_module: euclidean
dataset:
segment:
root_dir: ???
label_transform: norm
data_name: LEVIR
dataset: CNDataset
multi_scale_train: true
multi_scale_infer: false
num_classes: 2
img_size: 256
batch_size: 8
workers: 2
shuffle: true
image_folder_name: A
change_image_folder_name: B
list_folder_name: list
annotation_folder_name: label
augmentation:
random_flip:
vflip_probability: 0.5
hflip_probability: 0.5
enable: true
random_rotate:
rotate_probability: 0.5
angle_list:
- 90.0
- 180.0
- 270.0
enable: true
random_color:
brightness: 0.3
contrast: 0.3
saturation: 0.3
hue: 0.3
enable: true
with_scale_random_crop:
scale_range:
- 1.0
- 1.2
enable: true
with_random_blur: true
with_random_crop: true
mean:
- 0.5
- 0.5
- 0.5
std:
- 0.5
- 0.5
- 0.5
train_split: train
validation_split: val
test_split: test
predict_split: test
label_suffix: .png
color_map: null
classify:
train_dataset:
csv_path: /data/dataset_convert/train_combined.csv
images_dir: /data/images/
validation_dataset:
csv_path: /data/dataset_convert/valid_combined.csv
images_dir: /data/images/
test_dataset:
csv_path: /data/dataset_convert/valid_combined.csv
images_dir: /data/images/
infer_dataset:
csv_path: /data/dataset_convert/valid_combined.csv
images_dir: /data/images/
image_ext: .jpg
batch_size: 5
workers: 4
fpratio_sampling: 0.2
num_input: 1
input_map: null
grid_map:
x: 1
‘y’: 1
concat_type: linear
output_shape:
- 256
- 256
augmentation_config:
rgb_input_mean:
- 0.485
- 0.456
- 0.406
rgb_input_std:
- 0.229
- 0.224
- 0.225
num_classes: 2
train:
optim:
monitor_name: val_loss
optim: adamw
lr: 5.0e-05
policy: linear
momentum: 0.9
weight_decay: 0.01
num_epochs: 30
num_nodes: 1
val_interval: 1
checkpoint_interval: 1
pretrained_model_path: null
resume_training_checkpoint_path: null
results_dir: ${results_dir}/train
classify:
loss: contrastive
cls_weight:
- 1.0
- 10.0
segment:
loss: ce
weights:
- 0.5
- 0.5
- 0.5
- 0.8
- 1.0
tensorboard:
enabled: true
infrequent_logging_frequency: 1
evaluate:
num_gpus: 1
checkpoint: ${results_dir}/train/changenet_classify.pth
results_dir: null
vis_after_n_batches: 16
trt_engine: ${results_dir}/gen_trt_engine/changenet-classify.trt
batch_size: ${dataset.classify.batch_size}
inference:
num_gpus: 1
checkpoint: ${results_dir}/train/changenet_classify.pth
results_dir: null
vis_after_n_batches: 1
trt_engine: ${results_dir}/gen_trt_engine/changenet-classify.trt
batch_size: ${dataset.classify.batch_size}
gpu_id: 0
export:
results_dir: null
gpu_id: 0
checkpoint: ${results_dir}/train/changenet_classify.pth
onnx_file: ${results_dir}/export/changenet-classify.onnx
on_cpu: false
input_channel: 3
input_width: 128
input_height: 512
opset_version: 12
batch_size: ${dataset.classify.batch_size}
verbose: false
gen_trt_engine:
results_dir: null
gpu_id: 0
onnx_file: ${results_dir}/export/changenet-classify.onnx
trt_engine: ${results_dir}/gen_trt_engine/changenet-classify.trt
input_channel: 3
input_width: 128
input_height: 512
opset_version: 12
batch_size: ${dataset.classify.batch_size}
verbose: false
tensorrt:
data_type: FP32
workspace_size: 1024
min_batch_size: 1
opt_batch_size: 10
max_batch_size: 10
calibration:
cal_image_dir: ???
cal_cache_file: ???
cal_batch_size: 1
cal_batches: 1
encryption_key: ‘********’
results_dir: /results/train
num_gpus: 1
task: classify
:245: UserWarning: Log file already exists at /results/train/status.json
Number of output classes: 2
Total Parameters: 76331605
Trainable Parameters: 76331605
/usr/local/lib/python3.8/dist-packages/torchmetrics/utilities/prints.py:36: UserWarning: Torchmetrics v0.9 introduced a new argument class property called full_state_update
that has
not been set for this class (AOIMetrics). The property determines if update
by
default needs access to the full metric state. If this is not the case, significant speedups can be
achieved and we recommend setting this to False
.
We provide an checking function
from torchmetrics.utilities import check_forward_full_state_property
that can be used to check if the full_state_update=True
(old and potential slower behaviour,
default for now) or if full_state_update=False
can be used safely.
warnings.warn(*args, **kwargs)
Tensorboard logging enabled.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Setting resume checkpoint to None
Results directory /results/train Checkpoint Interval 1
/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:604: UserWarning: Checkpoint directory /results/train exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
0 | model | ChangeNetClassify | 76.3 M
1 | criterion | ContrastiveLoss | 0
2 | train_metrics | AOIMetrics | 0
3 | val_metrics | AOIMetrics | 0
76.3 M Trainable params
0 Non-trainable params
76.3 M Total params
305.326 Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]Using 1 input types and linear type 1 X 1 for comparison
Number of steps for validation: 1930
Using 1 input types and linear type 1 X 1 for comparison
Sampling Defective components at 00.00:1 rate
Number of steps for training: 17305
Training: 0it [00:00, ?it/s]Starting Training Loop.
Epoch 0: 0%| | 0/19235 [00:00<?, ?it/s]invalid multinomial distribution (sum of probabilities <= 0)
Error executing job with overrides:
Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/visual_changenet/scripts/train.py”, line 236, in main
raise e
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/visual_changenet/scripts/train.py”, line 218, in main
run_experiment(experiment_config=cfg,
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/visual_changenet/scripts/train.py”, line 199, in run_experiment
trainer.fit(model, ckpt_path=resume_ckpt or None)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 603, in fit
call._call_and_handle_interrupt(
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py”, line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1098, in _run
results = self._run_stage()
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1177, in _run_stage
self._run_train()
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1200, in _run_train
self.fit_loop.run()
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/loop.py”, line 199, in run
self.advance(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/fit_loop.py”, line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/loop.py”, line 194, in run
self.on_run_start(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py”, line 161, in on_run_start
_ = iter(data_fetcher) # creates the iterator inside the fetcher
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/fetching.py”, line 179, in iter
self._apply_patch()
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/fetching.py”, line 120, in _apply_patch
apply_to_collections(self.loaders, self.loader_iters, (Iterator, DataLoader), _apply_patch_fn)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/fetching.py”, line 156, in loader_iters
return self.dataloader_iter.loader_iters
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/supporters.py”, line 555, in loader_iters
self._loader_iters = self.create_loader_iters(self.loaders)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/supporters.py”, line 595, in create_loader_iters
return apply_to_collection(loaders, Iterable, iter, wrong_dtype=(Sequence, Mapping))
File “/usr/local/lib/python3.8/dist-packages/lightning_utilities/core/apply_func.py”, line 51, in apply_to_collection
return function(data, *args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py”, line 443, in iter
return self._get_iterator()
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py”, line 389, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py”, line 1086, in init
self._reset(loader, first_iter=True)
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py”, line 1119, in _reset
self._try_put_index()
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py”, line 1353, in _try_put_index
index = self._next_index()
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py”, line 625, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/sampler.py”, line 247, in iter
batch = [next(sampler_iter) for _ in range(self.batch_size)]
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/sampler.py”, line 247, in
batch = [next(sampler_iter) for _ in range(self.batch_size)]
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/sampler.py”, line 203, in iter
rand_tensor = torch.multinomial(self.weights, self.num_samples, self.replacement, generator=self.generator)
RuntimeError: invalid multinomial distribution (sum of probabilities <= 0)
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Epoch 0: 0%| | 0/19235 [00:01<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 375) of binary: /usr/bin/python
Traceback (most recent call last):
File “/usr/local/bin/torchrun”, line 33, in
sys.exit(load_entry_point(‘torch==1.14.0a0+44dac51’, ‘console_scripts’, ‘torchrun’)())
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 346, in wrapper
return f(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py”, line 762, in main
run(args)
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py”, line 753, in run
elastic_launch(
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py”, line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py”, line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/visual_changenet/scripts/train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2023-11-16_08:41:05
host : acd3402cffc5
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 375)
error_file: <N/A>
traceback : To enable traceback see: Error Propagation — PyTorch 2.1 documentation
Execution status: FAIL
2023-11-16 08:41:12,701 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 337: Stopping container.
OK, please try to download and run with the dataset mentioned in another
two notebooks.
https://github.com/NVIDIA/tao_tutorials/blob/95aca39c79cb9068593a6a9c3dcc7a509f4ad786/notebooks/tao_launcher_starter_kit/visual_changenet/visual_changenet_segmentation.ipynb and
https://github.com/NVIDIA/tao_tutorials/blob/95aca39c79cb9068593a6a9c3dcc7a509f4ad786/notebooks/tao_launcher_starter_kit/visual_changenet/visual_changenet_segmentation_MVTec.ipynb.
Do you mean to download the dataset in these two notebook, and then run with the downloaded dataset in visual_changenet_classification notebook?
Yes, right.
Training with the LEVIR-CD256 dataset you mentioned. Set image_ext: .png in file experiment_classify.yaml. but it still find the image file by .jpg extension as training data.
fialed log:
experiment_classify.yaml
part of val_combined.csv
The error comes from https://github.com/NVIDIA/tao_pytorch_backend/blob/1a94305efa8ac6425d655b00e21c1375a8d3302f/nvidia_tao_pytorch/cv/optical_inspection/dataloader/oi_dataset.py#L83.
Can you attach the full yaml file?
Sure.
experiment_classify.zip (988 Bytes)
Seems there is an additional issue in https://github.com/NVIDIA/tao_pytorch_backend/blob/1a94305efa8ac6425d655b00e21c1375a8d3302f/nvidia_tao_pytorch/cv/optical_inspection/dataloader/oi_dataset.py#L65.
Please open a terminal, to run training inside the docker.
Step:
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.1.0-pyt /bin/bash
Then inside the docker, copy the content from https://github.com/NVIDIA/tao_pytorch_backend/blob/1a94305efa8ac6425d655b00e21c1375a8d3302f/nvidia_tao_pytorch/cv/optical_inspection/dataloader/oi_dataset.py and save it to /usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/optical_inspection/dataloader/oi_dataset.py.
Then
$ vim /usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/optical_inspection/dataloader/oi_dataset.py to delete line65.
#self.ext = '.jpg'
Then run training again inside the docker. Please note that “tao” is not needed in the beginning.
#
visual_changenet train xxx
OK. I will try it again. Thanks for kindly support.
It is got failed results training with LEVIR-CD256 as following log. (invalid multinomial distribution (sum of probabilities <= 0))
root@6a722181a39e:/opt/nvidia/tools# visual_changenet train -e /specs/experiment_classify.yaml
INFO: Created a temporary directory at /tmp/tmphbwp2zwg
INFO: Writing /tmp/tmphbwp2zwg/_remote_module_non_scriptable.py
sys:1: UserWarning:
‘experiment_classify.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
:107: UserWarning:
‘experiment_classify.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
Train results will be saved at: /results/train
Experiment configuration:
model:
backbone:
type: fan_small_12_p4_hybrid
feat_downsample: false
pretrained_backbone_path: null
decode_head:
in_channels:
- 128
- 256
- 384
- 384
in_index:
- 0
- 1
- 2
- 3
feature_strides:
- 4
- 8
- 16
- 16
align_corners: false
decoder_params:
embed_dim: 256
classify:
train_margin_euclid: 2.0
eval_margin: 0.005
embedding_vectors: 5
embed_dec: 30
learnable_difference_modules: 4
difference_module: euclidean
dataset:
segment:
root_dir: ???
label_transform: norm
data_name: LEVIR
dataset: CNDataset
multi_scale_train: true
multi_scale_infer: false
num_classes: 2
img_size: 256
batch_size: 8
workers: 2
shuffle: true
image_folder_name: A
change_image_folder_name: B
list_folder_name: list
annotation_folder_name: label
augmentation:
random_flip:
vflip_probability: 0.5
hflip_probability: 0.5
enable: true
random_rotate:
rotate_probability: 0.5
angle_list:
- 90.0
- 180.0
- 270.0
enable: true
random_color:
brightness: 0.3
contrast: 0.3
saturation: 0.3
hue: 0.3
enable: true
with_scale_random_crop:
scale_range:
- 1.0
- 1.2
enable: true
with_random_blur: true
with_random_crop: true
mean:
- 0.5
- 0.5
- 0.5
std:
- 0.5
- 0.5
- 0.5
train_split: train
validation_split: val
test_split: test
predict_split: test
label_suffix: .png
color_map: null
classify:
train_dataset:
csv_path: /data/list/train_combined.csv
images_dir: /data/
validation_dataset:
csv_path: /data/list/val_combined.csv
images_dir: /data/
test_dataset:
csv_path: /data/list/test_combined.csv
images_dir: /data/
infer_dataset:
csv_path: /data/list/test_combined.csv
images_dir: /data/
image_ext: .png
batch_size: 5
workers: 4
fpratio_sampling: 0.2
num_input: 1
input_map: null
grid_map:
x: 1
‘y’: 1
concat_type: linear
output_shape:
- 256
- 256
augmentation_config:
rgb_input_mean:
- 0.485
- 0.456
- 0.406
rgb_input_std:
- 0.229
- 0.224
- 0.225
num_classes: 2
train:
optim:
monitor_name: val_loss
optim: adamw
lr: 5.0e-05
policy: linear
momentum: 0.9
weight_decay: 0.01
num_epochs: 30
num_nodes: 1
val_interval: 1
checkpoint_interval: 1
pretrained_model_path: null
resume_training_checkpoint_path: null
results_dir: ${results_dir}/train
classify:
loss: contrastive
cls_weight:
- 1.0
- 10.0
segment:
loss: ce
weights:
- 0.5
- 0.5
- 0.5
- 0.8
- 1.0
tensorboard:
enabled: true
infrequent_logging_frequency: 1
evaluate:
num_gpus: 1
checkpoint: ${results_dir}/train/changenet_classify.pth
results_dir: null
vis_after_n_batches: 16
trt_engine: ${results_dir}/gen_trt_engine/changenet-classify.trt
batch_size: ${dataset.classify.batch_size}
inference:
num_gpus: 1
checkpoint: ${results_dir}/train/changenet_classify.pth
results_dir: null
vis_after_n_batches: 1
trt_engine: ${results_dir}/gen_trt_engine/changenet-classify.trt
batch_size: ${dataset.classify.batch_size}
gpu_id: 0
export:
results_dir: null
gpu_id: 0
checkpoint: ${results_dir}/train/changenet_classify.pth
onnx_file: ${results_dir}/export/changenet-classify.onnx
on_cpu: false
input_channel: 3
input_width: 128
input_height: 512
opset_version: 12
batch_size: ${dataset.classify.batch_size}
verbose: false
gen_trt_engine:
results_dir: null
gpu_id: 0
onnx_file: ${results_dir}/export/changenet-classify.onnx
trt_engine: ${results_dir}/gen_trt_engine/changenet-classify.trt
input_channel: 3
input_width: 128
input_height: 512
opset_version: 12
batch_size: ${dataset.classify.batch_size}
verbose: false
tensorrt:
data_type: FP32
workspace_size: 1024
min_batch_size: 1
opt_batch_size: 10
max_batch_size: 10
calibration:
cal_image_dir: ???
cal_cache_file: ???
cal_batch_size: 1
cal_batches: 1
encryption_key: ‘***’
results_dir: /results/train
num_gpus: 1
task: classify
:245: UserWarning: Log file already exists at /results/train/status.json
Number of output classes: 2
Total Parameters: 76331605
Trainable Parameters: 76331605
/usr/local/lib/python3.8/dist-packages/torchmetrics/utilities/prints.py:36: UserWarning: Torchmetrics v0.9 introduced a new argument class property called full_state_update
that has
not been set for this class (AOIMetrics). The property determines if update
by
default needs access to the full metric state. If this is not the case, significant speedups can be
achieved and we recommend setting this to False
.
We provide an checking function
from torchmetrics.utilities import check_forward_full_state_property
that can be used to check if the full_state_update=True
(old and potential slower behaviour,
default for now) or if full_state_update=False
can be used safely.
warnings.warn(*args, **kwargs)
Tensorboard logging enabled.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Setting resume checkpoint to None
Results directory /results/train Checkpoint Interval 1
/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:604: UserWarning: Checkpoint directory /results/train exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
0 | model | ChangeNetClassify | 76.3 M
1 | criterion | ContrastiveLoss | 0
2 | train_metrics | AOIMetrics | 0
3 | val_metrics | AOIMetrics | 0
76.3 M Trainable params
0 Non-trainable params
76.3 M Total params
305.326 Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]Using 1 input types and linear type 1 X 1 for comparison
Number of steps for validation: 205
Using 1 input types and linear type 1 X 1 for comparison
Sampling Defective components at 00.00:1 rate
Number of steps for training: 1424
Training: 0it [00:00, ?it/s]Starting Training Loop.
Epoch 0: 0%| | 0/1629 [00:00<?, ?it/s]invalid multinomial distribution (sum of probabilities <= 0)
Error executing job with overrides:
Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/visual_changenet/scripts/train.py”, line 236, in main
raise e
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/visual_changenet/scripts/train.py”, line 218, in main
run_experiment(experiment_config=cfg,
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/visual_changenet/scripts/train.py”, line 199, in run_experiment
trainer.fit(model, ckpt_path=resume_ckpt or None)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 603, in fit
call._call_and_handle_interrupt(
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py”, line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1098, in _run
results = self._run_stage()
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1177, in _run_stage
self._run_train()
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1200, in _run_train
self.fit_loop.run()
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/loop.py”, line 199, in run
self.advance(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/fit_loop.py”, line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/loop.py”, line 194, in run
self.on_run_start(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py”, line 161, in on_run_start
_ = iter(data_fetcher) # creates the iterator inside the fetcher
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/fetching.py”, line 179, in iter
self._apply_patch()
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/fetching.py”, line 120, in _apply_patch
apply_to_collections(self.loaders, self.loader_iters, (Iterator, DataLoader), _apply_patch_fn)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/fetching.py”, line 156, in loader_iters
return self.dataloader_iter.loader_iters
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/supporters.py”, line 555, in loader_iters
self._loader_iters = self.create_loader_iters(self.loaders)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/supporters.py”, line 595, in create_loader_iters
return apply_to_collection(loaders, Iterable, iter, wrong_dtype=(Sequence, Mapping))
File “/usr/local/lib/python3.8/dist-packages/lightning_utilities/core/apply_func.py”, line 51, in apply_to_collection
return function(data, *args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py”, line 443, in iter
return self._get_iterator()
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py”, line 389, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py”, line 1086, in init
self._reset(loader, first_iter=True)
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py”, line 1119, in _reset
self._try_put_index()
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py”, line 1353, in _try_put_index
index = self._next_index()
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py”, line 625, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/sampler.py”, line 247, in iter
batch = [next(sampler_iter) for _ in range(self.batch_size)]
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/sampler.py”, line 247, in
batch = [next(sampler_iter) for _ in range(self.batch_size)]
File “/usr/local/lib/python3.8/dist-packages/torch/utils/data/sampler.py”, line 203, in iter
rand_tensor = torch.multinomial(self.weights, self.num_samples, self.replacement, generator=self.generator)
RuntimeError: invalid multinomial distribution (sum of probabilities <= 0)
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Epoch 0: 0%| | 0/1629 [00:01<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 647) of binary: /usr/bin/python
Traceback (most recent call last):
File “/usr/local/bin/torchrun”, line 33, in
sys.exit(load_entry_point(‘torch==1.14.0a0+44dac51’, ‘console_scripts’, ‘torchrun’)())
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 346, in wrapper
return f(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py”, line 762, in main
run(args)
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py”, line 753, in run
elastic_launch(
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py”, line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py”, line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/visual_changenet/scripts/train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2023-11-21_01:46:30
host : 6a722181a39e
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 647)
error_file: <N/A>
traceback : To enable traceback see: Error Propagation — PyTorch 2.1 documentation
Execution status: FAIL
Thanks for the info. I will use this dataset and try to reproduce on my side.
Hi Morganh,
Sorry. It works. That is because all data just use one class label. I randomly assign two labels “PASS” and “MISSING”. then it can work. Now. I will check our data if they have same problem. Thank you for help.
Great. Thanks for the info. Glad to know it is working now.
Hi Morganh,
Training with dataset it is failed with labels “OK” and “NG”, but success with “PASS” and “MISSING”. Does model changenet_classifier just accept labels “PASS” and “MISSING”? and can’t do more than two classes classification? Thanks.
Could you try “PASS” and “NG”?
Sure.
“PASS” and “NG” is working. So, what could be the problem?
OK. Currently, the “PASS” label is needed.
It is due to https://github.com/NVIDIA/tao_pytorch_backend/blob/1a94305efa8ac6425d655b00e21c1375a8d3302f/nvidia_tao_pytorch/cv/optical_inspection/dataloader/oi_dataset.py#L145.
More info can be found in Data Annotation Format - NVIDIA Docs